• DocumentCode
    600224
  • Title

    Source-Side Suffix Stripping for Bengali-to-English SMT

  • Author

    Haque, Rakibul ; Penkale, Sergio ; Jie Jiang ; Way, Andy

  • Author_Institution
    Appl. Language Solutions Delph, Oldham, UK
  • fYear
    2012
  • fDate
    13-15 Nov. 2012
  • Firstpage
    193
  • Lastpage
    196
  • Abstract
    Data sparseness is a well-known problem for statistical machine translation (SMT) when morphologically rich and highly inflected languages are involved. This problem become worse in resource-scarce scenarios where sufficient parallel corpora are not available for model training. Recent research has shown that morphological segmentation can be employed on either side of the translation pair to reduce data sparsity. In this work, we consider a highly inflected Indian language as the source-side of the translation pair, Bengali. This paper presents study of morphological segmentation in SMT with a less explored translation pair, Bengali-to-English. We worked with a tiny training set available for this language-pair. We employ a simple suffix-stripping method for lemmatizing inflected Bengali words. We show that our morphological suffix separation process significantly reduces data sparseness. We also show that an SMT model trained on suffix-stripped (source) training data significantly outperforms the state-of-the-art phrase-based SMT (PB-SMT) baseline.
  • Keywords
    language translation; natural language processing; statistical analysis; Bengali words; Bengali-to-English SMT model; Indian language; PB-SMT; data sparseness; data sparsity reduction; morphological segmentation; morphological suffix separation process; phrase-based SMT baseline; resource-scarce scenarios; source-side suffix-stripping method; statistical machine translation; translation pair; Accuracy; Computational linguistics; Morphology; Separation processes; Surface morphology; Training; Vocabulary; morphological segmentation; statistical machine translation;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Asian Language Processing (IALP), 2012 International Conference on
  • Conference_Location
    Hanoi
  • Print_ISBN
    978-1-4673-6113-2
  • Electronic_ISBN
    978-0-7695-4886-9
  • Type

    conf

  • DOI
    10.1109/IALP.2012.61
  • Filename
    6473729