Approaches to improving corpus quality for statistical machine translation

Author

Liu, Peng ; Zhou, Yu ; Zong, Cheng-qing

Author_Institution

Nat. Lab. of Pattern Recognition, Chinese Acad. of Sci., Beijing, China

Volume

6

fYear

2010

fDate

11-14 July 2010

Firstpage

3293

Lastpage

3298

Abstract

The performance of a statistical machine translation (SMT) system heavily depends on the quantity and quality of the bilingual language resource. However, the pervious work mainly focuses on the quantity and tries to collect more bilingual data. In this paper, we aim to optimize the bilingual corpus to improve the performance of the translation system. We propose methods to process the bilingual language data by filtering noise and selecting more informative sentences from the training corpus and the development corpus. The experimental results show that we can obtain a competitive performance using less data compared with using all available data.

Keywords

language translation; statistical analysis; bilingual language resource; corpus quality; development corpus; noise filtering; statistical machine translation; training corpus; Cybernetics; Data mining; Filtering theory; Machine learning; Noise; Training; Training data; Corpus optimization; Data selection; Noise filter; Statistical machine translation;

fLanguage

English

Publisher

ieee

Conference_Titel

Machine Learning and Cybernetics (ICMLC), 2010 International Conference on

Conference_Location

Qingdao

Print_ISBN

978-1-4244-6526-2

Type

conf

DOI

10.1109/ICMLC.2010.5580699

Filename

5580699