• DocumentCode
    2753681
  • Title

    Automatic Construction of English-Vietnamese Parallel Corpus through Web Mining

  • Author

    Dang, Van B. ; Ho, Bao-Quoc

  • Author_Institution
    Fac. of Inf. Technol., Univ. of Natural Sci., Ho Chi Minh City
  • fYear
    2007
  • fDate
    5-9 March 2007
  • Firstpage
    261
  • Lastpage
    266
  • Abstract
    Parallel corpus has become a very essential resource for multilingual natural language processing and there are large scale of parallel texts available on the Internet these days. In this paper, we propose a simple but reliable method to construct an English-Vietnamese parallel corpus through Web mining. Our system can automatically download and detect parallel Web pages on a given domain to construct a parallel corpus that is well-aligned at paragraph level with completely clean texts. The proposed technique can be easily applied to other language pairs. Experiments have been made and shown promising results.
  • Keywords
    Internet; data mining; natural language processing; Internet; Web mining; automatic English-Vietnamese parallel corpus construction; multilingual natural language processing; parallel Web pages; Detectors; Dictionaries; Filtering; Information retrieval; Information technology; Internet; Large-scale systems; Natural language processing; Uniform resource locators; Web mining; information retrieval; parallel corpus; web mining;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Research, Innovation and Vision for the Future, 2007 IEEE International Conference on
  • Conference_Location
    Hanoi
  • Print_ISBN
    1-4244-0694-3
  • Type

    conf

  • DOI
    10.1109/RIVF.2007.369166
  • Filename
    4223083