DocumentCode :
3570557
Title :
Building a free, general-domain paraphrase database for Japanese
Author :
Mizukami, Masahiro ; Neubig, Graham ; Sakti, Sakriani ; Toda, Tomoki ; Nakamura, Satoshi
Author_Institution :
Nara Inst. of Sci. & Technol., Ikoma, Japan
fYear :
2014
Firstpage :
1
Lastpage :
4
Abstract :
Previous works have used parallel corpora and alignment techniques from phrase-based statistical machine translation to extract and generate paraphrases. In Japanese, paraphrases for a number of paraphrase categories or domains have been extracted by this method. However, most of these resources focus on a particular phenomenon in Japanese, and there are still no Japanese paraphrase resources that cover all varieties of phrases from several domains, and are freely available. In addition, because Japanese and English vary in grammar and word ordering, we perform syntax-based preprocessing to reduce this mismatch and extract paraphrases similar in quality to those extracted using more similar language pairs. The data used in creating the Japanese paraphrases is either in the public domain, or available under the Creative Commons license, and spans a variety of genres for wide coverage.
Keywords :
audio databases; computational linguistics; grammars; language translation; natural language processing; statistical analysis; Creative Commons license; English; Japanese; general-domain paraphrase database; grammar; parallel corpora; phrase-based statistical machine translation; syntax-based preprocessing; word ordering; Grammar; Licenses; Free Data; General-Domain; Paraphrasing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Co-ordination and Standardization of Speech Databases and Assessment Techniques (COCOSDA), 2014 17th Oriental Chapter of the International Committee for the
Type :
conf
DOI :
10.1109/ICSDA.2014.7051433
Filename :
7051433
Link To Document :
بازگشت