• DocumentCode
    2873914
  • Title

    A Research on Multi-feature Word-Level Paraphrase Extracting System Based on Context

  • Author

    He Xian-Jiang ; Yu Zhong-hua

  • Author_Institution
    Coll. of Comput. Sci., Sichuan Univ., Chengdu, China
  • fYear
    2012
  • fDate
    2-4 Nov. 2012
  • Firstpage
    441
  • Lastpage
    444
  • Abstract
    The essence of paraphrasing lies in retrieving correct paraphrases. Word-level paraphrasing is sensitive to the context, and its critical indicator is interchangeability. This paper presents a two-stage multi-feature word-level Chinese paraphrase extracting method. In stage one, using data mining technology the target word and its candidate paraphrases are extracted from large-size corpuses and the Internet. In stage two, stratified probability statistical model is established, and seven similarity feature values which are to train binary classifier later are calculated. Finally, candidate paraphrases with high similarity values are filtered out. Experimental results show that (1) Retrieving candidate paraphrases from large-size corpuses through data mining has practical value. On average 3.1 correct paraphrases for a word are obtained, (2) The binary classifier is effective in filtering out the correct paraphrases, with an accuracy of 0.676; (3) 34.32% of the retrieved paraphrases cannot be found in the Chinese Expanded Synonym Dictionary, which proves that the paraphrase retrieving method presented in this paper is an expansion of the traditional paraphrase extracting methods.
  • Keywords
    data mining; dictionaries; feature extraction; information filtering; natural language processing; pattern classification; probability; statistical analysis; text analysis; Chinese expanded synonym dictionary; binary classifier training; context; data mining technology; interchangeability; large-size corpuses; multifeature word-level paraphrase extracting system; paraphrase retrieving method; paraphrases filtering; similarity feature values; stratified probability statistical model; target word; two-stage multifeature word-level Chinese paraphrase extracting method; Accuracy; Context; Data mining; Feature extraction; Semantics; Testing; Training; binary classifier; corpuses; multi-feature; paraphrase;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Multimedia Information Networking and Security (MINES), 2012 Fourth International Conference on
  • Conference_Location
    Nanjing
  • Print_ISBN
    978-1-4673-3093-0
  • Type

    conf

  • DOI
    10.1109/MINES.2012.43
  • Filename
    6405718