Title :
Improved Graph-Based Bilingual Corpus Selection with Sentence Pair Ranking for Statistical Machine Translation
Author :
Chao, Wenhan ; Li, Zhoujun
Author_Institution :
Sch. of Comput. Sci. & Eng., Beihang Univ., Beijing, China
Abstract :
In statistical machine translation, the number of sentence pairs in the bilingual corpus is very important to the quality of translation. However, when the quantity reaches some extent, enlarging corpus has less effect on the translation, whereas increasing greatly the time and space complexity to building translation systems, which hinders the development of statistical machine translation. In this paper, we propose several ranking approaches to measure the quantity of information of each sentence pair, and apply them into a graph-based bilingual corpus selection framework to form an improved corpus selection approach, which now considers the difference of the initial quantities of information between the sentence pairs. Our experiments in a Chinese-English translation task show that, selecting only 50% of the whole corpus via the graph-based selection approach as training set, we can obtain the near translation result with the one using the whole corpus, and we obtain better results than the baselines after using the IDF-related ranking approach.
Keywords :
computational complexity; graph theory; language translation; natural language processing; statistical analysis; Chinese-English translation task; IDF-related ranking; building translation system; graph-based bilingual corpus selection; information quantity measure; sentence pair ranking; space complexity; statistical machine translation quality; time complexity; Educational institutions; Equations; Mathematical model; Measurement; Redundancy; Training; Vocabulary; Corpus Selection; Graph; Ranking; SMT;
Conference_Titel :
Tools with Artificial Intelligence (ICTAI), 2011 23rd IEEE International Conference on
Conference_Location :
Boca Raton, FL
Print_ISBN :
978-1-4577-2068-0
Electronic_ISBN :
1082-3409
DOI :
10.1109/ICTAI.2011.73