• DocumentCode
    3433182
  • Title

    Mining the multilingual terminology from the web

  • Author

    Sadat, F.

  • Author_Institution
    Dept. of Comput. Sci., Univ. of Quebec in Montreal (UQAM), Montreal, QC, Canada
  • fYear
    2013
  • fDate
    27-29 Aug. 2013
  • Firstpage
    41
  • Lastpage
    45
  • Abstract
    Multilingual linguistic resources are usually constructed from parallel corpora, but since these corpora are available only for selected text domains and language pairs, the potential of other resources is being explored as well. This article seeks to explore and exploit the idea of using multilingual Web-based encyclopaedias such as Wikipedia as comparable corpora for multilingual terminology extraction. We propose an approach to extract terms and their translations from different types of Wikipedia link information and texts. The next step will be using the linguistics information to re-rank and filter the extracted term candidates in the target language. Preliminary evaluations using the combined statistics-based and linguistic-based approaches were applied on Japanese-French-English languages. These evaluations showed a real open improvement and good quality of the extracted term candidates for building or enriching multilingual ontologies, dictionaries or feeding a cross-language information retrieval system with the related expansion terms of the source query.
  • Keywords
    Internet; Web sites; data mining; dictionaries; encyclopaedias; information filtering; information retrieval systems; ontologies (artificial intelligence); text analysis; Japanese-French-English languages; Web; Wikipedia link information; combined statistics-based-linguistic-based approaches; cross-language information retrieval system; dictionaries; information filtering; language pairs; multilingual Web-based encyclopaedias; multilingual linguistic resources; multilingual ontologies; multilingual terminology extraction; multilingual terminology mining; parallel corpora; source query; text domains; Electronic publishing; Encyclopedias; Internet; Ontologies; Terminology; Web application; lexical database; natural language processing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Communications, Computers and Signal Processing (PACRIM), 2013 IEEE Pacific Rim Conference on
  • Conference_Location
    Victoria, BC
  • ISSN
    1555-5798
  • Type

    conf

  • DOI
    10.1109/PACRIM.2013.6625446
  • Filename
    6625446