• DocumentCode
    2248250
  • Title

    Approaches to improving corpus quality for statistical machine translation

  • Author

    Liu, Peng ; Zhou, Yu ; Zong, Cheng-qing

  • Author_Institution
    Nat. Lab. of Pattern Recognition, Chinese Acad. of Sci., Beijing, China
  • Volume
    6
  • fYear
    2010
  • fDate
    11-14 July 2010
  • Firstpage
    3293
  • Lastpage
    3298
  • Abstract
    The performance of a statistical machine translation (SMT) system heavily depends on the quantity and quality of the bilingual language resource. However, the pervious work mainly focuses on the quantity and tries to collect more bilingual data. In this paper, we aim to optimize the bilingual corpus to improve the performance of the translation system. We propose methods to process the bilingual language data by filtering noise and selecting more informative sentences from the training corpus and the development corpus. The experimental results show that we can obtain a competitive performance using less data compared with using all available data.
  • Keywords
    language translation; statistical analysis; bilingual language resource; corpus quality; development corpus; noise filtering; statistical machine translation; training corpus; Cybernetics; Data mining; Filtering theory; Machine learning; Noise; Training; Training data; Corpus optimization; Data selection; Noise filter; Statistical machine translation;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Machine Learning and Cybernetics (ICMLC), 2010 International Conference on
  • Conference_Location
    Qingdao
  • Print_ISBN
    978-1-4244-6526-2
  • Type

    conf

  • DOI
    10.1109/ICMLC.2010.5580699
  • Filename
    5580699