• DocumentCode
    533642
  • Title

    Extracting Parallel Texts from the Web

  • Author

    Le Quang Hung ; Cuong, Le Anh

  • Author_Institution
    Fac. of Inf. Technol., Quynhon Univ., Vietnam
  • fYear
    2010
  • fDate
    7-9 Oct. 2010
  • Firstpage
    147
  • Lastpage
    151
  • Abstract
    Parallel corpus is the valuable resource for some important applications of natural language processing such as statistical machine translation, dictionary construction, cross-language information retrieval. The Web is a huge resource of knowledge, which partly contains bilingual information in various kinds of web pages. It currently attracts many studies on building parallel corpora based on the Internet resource. However, obtaining a parallel corpus with high accuracy is still a challenge. This paper focuses on extracting parallel texts from bilingual web-sites of the English and Vietnamese language pair. We first propose a new way of designing content-based features, and then combining them with structural features under a framework of machine learning. In the experiment we obtain 88.2% of precision for the extracted parallel texts.
  • Keywords
    Web services; Web sites; content-based retrieval; learning (artificial intelligence); natural language processing; text analysis; English language; Internet resource; Vietnamese language; Web pages; bilingual Web-sites; bilingual information; content-based features; knowledge resource; machine learning; natural language processing; parallel corpora; parallel texts; Data mining; Dictionaries; Feature extraction; HTML; Support vector machines; Web pages;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Knowledge and Systems Engineering (KSE), 2010 Second International Conference on
  • Conference_Location
    Hanoi
  • Print_ISBN
    978-1-4244-8334-1
  • Type

    conf

  • DOI
    10.1109/KSE.2010.14
  • Filename
    5632135