• DocumentCode
    600219
  • Title

    Automatically Mining Parallel Corpora for Minority Languages from Web Pages

  • Author

    Zede Zhu ; Miao Li ; Lei Chen ; Weihui Zeng

  • Author_Institution
    Inst. of Intell. Machines, Hefei, China
  • fYear
    2012
  • fDate
    13-15 Nov. 2012
  • Firstpage
    97
  • Lastpage
    100
  • Abstract
    Parallel corpora are indispensable resources for a variety of multilingual natural language processing. This paper describes a system, which mines automatically parallel corpora from web pages. It attempts to overcome the shortage of parallel corpora in minority languages. Learning from the existing technology of mining web bilingual corpora, and combining with the characteristics of minority languages bilingual websites, a method, mining parallel corpora in minority languages based on heuristic information extracted from content, is proposed. Experiments, carried out on the Chinese-Mongolian language pair, show that the system is successful in automatically identifying a significant amount of parallel texts from the World Wide Web.
  • Keywords
    Web sites; data mining; linguistics; natural language processing; text analysis; Chinese-Mongolian language pair; Web bilingual corpora mining; Web pages; World Wide Web; automatic parallel corpora mining; automatic parallel text identification; bilingual Web sites; heuristic information extraction; minority languages; multilingual natural language processing; Data mining; Feature extraction; HTML; Natural language processing; Support vector machines; Web pages; extracting content; identifying parallel pairs; minority languages; parallel corpora; web mining;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Asian Language Processing (IALP), 2012 International Conference on
  • Conference_Location
    Hanoi
  • Print_ISBN
    978-1-4673-6113-2
  • Electronic_ISBN
    978-0-7695-4886-9
  • Type

    conf

  • DOI
    10.1109/IALP.2012.29
  • Filename
    6473705