Title :
Segmenting Long Sentence Pairs for Statistical Machine Translation
Author :
Meng, Biping ; Huang, Shujian ; Dai, Xinyu ; Chen, Jiajun
Author_Institution :
Dept. of Comput. Sci. & Technol., Nanjing Univ., Nanjing, China
Abstract :
In phrase-based statistical machine translation, the knowledge about phrase translation and phrase reordering is learned from the bilingual corpora. However, words may be poorly aligned in long sentence pairs in practice, which will then do harm to the following steps of the translation, such as phrase extraction, etc. A possible solution to this problem is segmenting long sentence pairs into shorter ones. In this paper, we present an effective approach to segmenting sentences based on the modified IBM translation model 1. We find that by taking into account the semantics of some words, as well as the length ratio of source and target sentences, the segmentation result is largely improved. We also discuss the effect of length factor to the segmentation result. Experiments show that our approach can improve the BLEU score of a phrase-based translation system by about 0.5 points.
Keywords :
language translation; IBM translation model; bilingual corpora; long sentence pairs segmentation; phrase extraction; phrase reordering; phrase translation; phrase-based statistical machine translation; Computer science; Costs; Particle separators; Poisson distributed length ratio; length normalization; long sentences; segmentation; semantics guided;
Conference_Titel :
Asian Language Processing, 2009. IALP '09. International Conference on
Conference_Location :
Singapore
Print_ISBN :
978-0-7695-3904-1
DOI :
10.1109/IALP.2009.20