DocumentCode :
600224
Title :
Source-Side Suffix Stripping for Bengali-to-English SMT
Author :
Haque, Rakibul ; Penkale, Sergio ; Jie Jiang ; Way, Andy
Author_Institution :
Appl. Language Solutions Delph, Oldham, UK
fYear :
2012
fDate :
13-15 Nov. 2012
Firstpage :
193
Lastpage :
196
Abstract :
Data sparseness is a well-known problem for statistical machine translation (SMT) when morphologically rich and highly inflected languages are involved. This problem become worse in resource-scarce scenarios where sufficient parallel corpora are not available for model training. Recent research has shown that morphological segmentation can be employed on either side of the translation pair to reduce data sparsity. In this work, we consider a highly inflected Indian language as the source-side of the translation pair, Bengali. This paper presents study of morphological segmentation in SMT with a less explored translation pair, Bengali-to-English. We worked with a tiny training set available for this language-pair. We employ a simple suffix-stripping method for lemmatizing inflected Bengali words. We show that our morphological suffix separation process significantly reduces data sparseness. We also show that an SMT model trained on suffix-stripped (source) training data significantly outperforms the state-of-the-art phrase-based SMT (PB-SMT) baseline.
Keywords :
language translation; natural language processing; statistical analysis; Bengali words; Bengali-to-English SMT model; Indian language; PB-SMT; data sparseness; data sparsity reduction; morphological segmentation; morphological suffix separation process; phrase-based SMT baseline; resource-scarce scenarios; source-side suffix-stripping method; statistical machine translation; translation pair; Accuracy; Computational linguistics; Morphology; Separation processes; Surface morphology; Training; Vocabulary; morphological segmentation; statistical machine translation;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Asian Language Processing (IALP), 2012 International Conference on
Conference_Location :
Hanoi
Print_ISBN :
978-1-4673-6113-2
Electronic_ISBN :
978-0-7695-4886-9
Type :
conf
DOI :
10.1109/IALP.2012.61
Filename :
6473729
Link To Document :
بازگشت