Building a free, general-domain paraphrase database for Japanese

Author

Mizukami, Masahiro ; Neubig, Graham ; Sakti, Sakriani ; Toda, Tomoki ; Nakamura, Satoshi

Author_Institution

Nara Inst. of Sci. & Technol., Ikoma, Japan

fYear

2014

Firstpage

1

Lastpage

4

Abstract

Previous works have used parallel corpora and alignment techniques from phrase-based statistical machine translation to extract and generate paraphrases. In Japanese, paraphrases for a number of paraphrase categories or domains have been extracted by this method. However, most of these resources focus on a particular phenomenon in Japanese, and there are still no Japanese paraphrase resources that cover all varieties of phrases from several domains, and are freely available. In addition, because Japanese and English vary in grammar and word ordering, we perform syntax-based preprocessing to reduce this mismatch and extract paraphrases similar in quality to those extracted using more similar language pairs. The data used in creating the Japanese paraphrases is either in the public domain, or available under the Creative Commons license, and spans a variety of genres for wide coverage.

Keywords

audio databases; computational linguistics; grammars; language translation; natural language processing; statistical analysis; Creative Commons license; English; Japanese; general-domain paraphrase database; grammar; parallel corpora; phrase-based statistical machine translation; syntax-based preprocessing; word ordering; Grammar; Licenses; Free Data; General-Domain; Paraphrasing;

fLanguage

English

Publisher

ieee

Conference_Titel

Co-ordination and Standardization of Speech Databases and Assessment Techniques (COCOSDA), 2014 17th Oriental Chapter of the International Committee for the

Type

conf

DOI

10.1109/ICSDA.2014.7051433

Filename

7051433