DocumentCode :
2248250
Title :
Approaches to improving corpus quality for statistical machine translation
Author :
Liu, Peng ; Zhou, Yu ; Zong, Cheng-qing
Author_Institution :
Nat. Lab. of Pattern Recognition, Chinese Acad. of Sci., Beijing, China
Volume :
6
fYear :
2010
fDate :
11-14 July 2010
Firstpage :
3293
Lastpage :
3298
Abstract :
The performance of a statistical machine translation (SMT) system heavily depends on the quantity and quality of the bilingual language resource. However, the pervious work mainly focuses on the quantity and tries to collect more bilingual data. In this paper, we aim to optimize the bilingual corpus to improve the performance of the translation system. We propose methods to process the bilingual language data by filtering noise and selecting more informative sentences from the training corpus and the development corpus. The experimental results show that we can obtain a competitive performance using less data compared with using all available data.
Keywords :
language translation; statistical analysis; bilingual language resource; corpus quality; development corpus; noise filtering; statistical machine translation; training corpus; Cybernetics; Data mining; Filtering theory; Machine learning; Noise; Training; Training data; Corpus optimization; Data selection; Noise filter; Statistical machine translation;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Machine Learning and Cybernetics (ICMLC), 2010 International Conference on
Conference_Location :
Qingdao
Print_ISBN :
978-1-4244-6526-2
Type :
conf
DOI :
10.1109/ICMLC.2010.5580699
Filename :
5580699
Link To Document :
بازگشت