• DocumentCode
    124204
  • Title

    Semantic Similarity Measurements for Multi-lingual Short Texts Using Wikipedia

  • Author

    Nakamura, T. ; Shirakawa, Masumi ; Hara, Tenshi ; Nishio, Shojiro

  • Author_Institution
    Dept. of Multimedia Eng., Osaka Univ., Suita, Japan
  • Volume
    2
  • fYear
    2014
  • fDate
    11-14 Aug. 2014
  • Firstpage
    22
  • Lastpage
    29
  • Abstract
    In this paper, we propose two methods to measure the semantic similarity for multi-lingual and short texts by using Wikipedia. In recent years, people around the world have been continuously generating information about their local area in their own languages on social networking services. Measuring the similarity between the texts is challenging because they are often short and written in various languages. Our methods solve this problem by incorporating inter-language links of Wikipedia into extended naive Bayes (ENB), a probabilistic method of semantic similarity measurements for short texts. The proposed methods represent a multi-lingual short text as a vector of the English version of Wikipedia articles (entities). We conducted an experiment on clustering of tweets written in four languages (English, Spanish, Japanese and Arabic). From the experimental results, we confirmed that our methods outperformed cross-lingual explicit semantic analysis (CL-ESA), which is a method to measure the similarity between texts written in two different languages. Moreover, our methods were competitive with ENB applied to texts that have been translated into English using Google Translate. Our methods enabled similarity measurements for multi-lingual short texts without the cost of machine translations.
  • Keywords
    Bayes methods; natural language processing; social networking (online); text analysis; Arabic language; CL-ESA; ENB; English language; English version; Google Translate; Japanese language; Spanish language; Wikipedia articles; Wikipedia entities; cross-lingual explicit semantic analysis; extended naive Bayes; interlanguage links; multilingual short text; probabilistic method; semantic similarity measurements; social networking services; tweets clustering; vector; Electronic publishing; Encyclopedias; Internet; Probabilistic logic; Semantics; Vectors;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Web Intelligence (WI) and Intelligent Agent Technologies (IAT), 2014 IEEE/WIC/ACM International Joint Conferences on
  • Conference_Location
    Warsaw
  • Type

    conf

  • DOI
    10.1109/WI-IAT.2014.76
  • Filename
    6927603