• DocumentCode
    3570557
  • Title

    Building a free, general-domain paraphrase database for Japanese

  • Author

    Mizukami, Masahiro ; Neubig, Graham ; Sakti, Sakriani ; Toda, Tomoki ; Nakamura, Satoshi

  • Author_Institution
    Nara Inst. of Sci. & Technol., Ikoma, Japan
  • fYear
    2014
  • Firstpage
    1
  • Lastpage
    4
  • Abstract
    Previous works have used parallel corpora and alignment techniques from phrase-based statistical machine translation to extract and generate paraphrases. In Japanese, paraphrases for a number of paraphrase categories or domains have been extracted by this method. However, most of these resources focus on a particular phenomenon in Japanese, and there are still no Japanese paraphrase resources that cover all varieties of phrases from several domains, and are freely available. In addition, because Japanese and English vary in grammar and word ordering, we perform syntax-based preprocessing to reduce this mismatch and extract paraphrases similar in quality to those extracted using more similar language pairs. The data used in creating the Japanese paraphrases is either in the public domain, or available under the Creative Commons license, and spans a variety of genres for wide coverage.
  • Keywords
    audio databases; computational linguistics; grammars; language translation; natural language processing; statistical analysis; Creative Commons license; English; Japanese; general-domain paraphrase database; grammar; parallel corpora; phrase-based statistical machine translation; syntax-based preprocessing; word ordering; Grammar; Licenses; Free Data; General-Domain; Paraphrasing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Co-ordination and Standardization of Speech Databases and Assessment Techniques (COCOSDA), 2014 17th Oriental Chapter of the International Committee for the
  • Type

    conf

  • DOI
    10.1109/ICSDA.2014.7051433
  • Filename
    7051433