DocumentCode
600224
Title
Source-Side Suffix Stripping for Bengali-to-English SMT
Author
Haque, Rakibul ; Penkale, Sergio ; Jie Jiang ; Way, Andy
Author_Institution
Appl. Language Solutions Delph, Oldham, UK
fYear
2012
fDate
13-15 Nov. 2012
Firstpage
193
Lastpage
196
Abstract
Data sparseness is a well-known problem for statistical machine translation (SMT) when morphologically rich and highly inflected languages are involved. This problem become worse in resource-scarce scenarios where sufficient parallel corpora are not available for model training. Recent research has shown that morphological segmentation can be employed on either side of the translation pair to reduce data sparsity. In this work, we consider a highly inflected Indian language as the source-side of the translation pair, Bengali. This paper presents study of morphological segmentation in SMT with a less explored translation pair, Bengali-to-English. We worked with a tiny training set available for this language-pair. We employ a simple suffix-stripping method for lemmatizing inflected Bengali words. We show that our morphological suffix separation process significantly reduces data sparseness. We also show that an SMT model trained on suffix-stripped (source) training data significantly outperforms the state-of-the-art phrase-based SMT (PB-SMT) baseline.
Keywords
language translation; natural language processing; statistical analysis; Bengali words; Bengali-to-English SMT model; Indian language; PB-SMT; data sparseness; data sparsity reduction; morphological segmentation; morphological suffix separation process; phrase-based SMT baseline; resource-scarce scenarios; source-side suffix-stripping method; statistical machine translation; translation pair; Accuracy; Computational linguistics; Morphology; Separation processes; Surface morphology; Training; Vocabulary; morphological segmentation; statistical machine translation;
fLanguage
English
Publisher
ieee
Conference_Titel
Asian Language Processing (IALP), 2012 International Conference on
Conference_Location
Hanoi
Print_ISBN
978-1-4673-6113-2
Electronic_ISBN
978-0-7695-4886-9
Type
conf
DOI
10.1109/IALP.2012.61
Filename
6473729
Link To Document