Domain adaptation for statistical machine translation in development corpus selection

Author

Zheng, Zhongguang ; He, Zhongjun ; Meng, Yao ; Yu, Hao

Author_Institution

Fujitsu R&D Center Co., Ltd., Taiwan

fYear

2010

Firstpage

2

Lastpage

7

Abstract

The performance of statistical machine translation (SMT) system is affected by model parameters (e.g. weights of feature functions), which are usually tuned on a development corpus. Most research done to date has focused on algorithms for tuning parameters. However, the selection of development corpus is lack of discussion. It is believed that the parameters trained on a proper corpus will improve translation performance. Instead of exploring new algorithms, this paper aims to select development corpus for tuning parameters according to the test set. We address this problem as domain adaptation and propose two methods based on information retrieval (IR) technique and text clustering (TC) technique, respectively. Experimental results show that both the methods yield more stable performance for tuning parameters than subjective selection of development corpus.

Keywords

information retrieval; language translation; statistical analysis; IR; SMT; TC; corpus selection development; domain adaptation; feature functions; information retrieval; model parameters; statistical machine translation; text clustering; tuning parameters; Adaptation model; Clustering methods; Feature extraction; Information retrieval; NIST; Training; Tuning;

fLanguage

English

Publisher

ieee

Conference_Titel

Universal Communication Symposium (IUCS), 2010 4th International

Conference_Location

Beijing

Print_ISBN

978-1-4244-7821-7

Type

conf

DOI

10.1109/IUCS.2010.5666775

Filename

5666775