DocumentCode
3570557
Title
Building a free, general-domain paraphrase database for Japanese
Author
Mizukami, Masahiro ; Neubig, Graham ; Sakti, Sakriani ; Toda, Tomoki ; Nakamura, Satoshi
Author_Institution
Nara Inst. of Sci. & Technol., Ikoma, Japan
fYear
2014
Firstpage
1
Lastpage
4
Abstract
Previous works have used parallel corpora and alignment techniques from phrase-based statistical machine translation to extract and generate paraphrases. In Japanese, paraphrases for a number of paraphrase categories or domains have been extracted by this method. However, most of these resources focus on a particular phenomenon in Japanese, and there are still no Japanese paraphrase resources that cover all varieties of phrases from several domains, and are freely available. In addition, because Japanese and English vary in grammar and word ordering, we perform syntax-based preprocessing to reduce this mismatch and extract paraphrases similar in quality to those extracted using more similar language pairs. The data used in creating the Japanese paraphrases is either in the public domain, or available under the Creative Commons license, and spans a variety of genres for wide coverage.
Keywords
audio databases; computational linguistics; grammars; language translation; natural language processing; statistical analysis; Creative Commons license; English; Japanese; general-domain paraphrase database; grammar; parallel corpora; phrase-based statistical machine translation; syntax-based preprocessing; word ordering; Grammar; Licenses; Free Data; General-Domain; Paraphrasing;
fLanguage
English
Publisher
ieee
Conference_Titel
Co-ordination and Standardization of Speech Databases and Assessment Techniques (COCOSDA), 2014 17th Oriental Chapter of the International Committee for the
Type
conf
DOI
10.1109/ICSDA.2014.7051433
Filename
7051433
Link To Document