DocumentCode :
2650706
Title :
Improved Graph-Based Bilingual Corpus Selection with Sentence Pair Ranking for Statistical Machine Translation
Author :
Chao, Wenhan ; Li, Zhoujun
Author_Institution :
Sch. of Comput. Sci. & Eng., Beihang Univ., Beijing, China
fYear :
2011
fDate :
7-9 Nov. 2011
Firstpage :
446
Lastpage :
451
Abstract :
In statistical machine translation, the number of sentence pairs in the bilingual corpus is very important to the quality of translation. However, when the quantity reaches some extent, enlarging corpus has less effect on the translation, whereas increasing greatly the time and space complexity to building translation systems, which hinders the development of statistical machine translation. In this paper, we propose several ranking approaches to measure the quantity of information of each sentence pair, and apply them into a graph-based bilingual corpus selection framework to form an improved corpus selection approach, which now considers the difference of the initial quantities of information between the sentence pairs. Our experiments in a Chinese-English translation task show that, selecting only 50% of the whole corpus via the graph-based selection approach as training set, we can obtain the near translation result with the one using the whole corpus, and we obtain better results than the baselines after using the IDF-related ranking approach.
Keywords :
computational complexity; graph theory; language translation; natural language processing; statistical analysis; Chinese-English translation task; IDF-related ranking; building translation system; graph-based bilingual corpus selection; information quantity measure; sentence pair ranking; space complexity; statistical machine translation quality; time complexity; Educational institutions; Equations; Mathematical model; Measurement; Redundancy; Training; Vocabulary; Corpus Selection; Graph; Ranking; SMT;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Tools with Artificial Intelligence (ICTAI), 2011 23rd IEEE International Conference on
Conference_Location :
Boca Raton, FL
ISSN :
1082-3409
Print_ISBN :
978-1-4577-2068-0
Electronic_ISBN :
1082-3409
Type :
conf
DOI :
10.1109/ICTAI.2011.73
Filename :
6103363
Link To Document :
بازگشت