• DocumentCode
    1720767
  • Title

    An overview of the challenges and progress in PeEn-SMT: First large scale Persian-English SMT system

  • Author

    Mohaghegh, Mahsa ; Sarrafzadeh, Abdolhossein

  • Author_Institution
    Sch. of Eng. & Adv. Technol., Massey Univ., Auckland, New Zealand
  • fYear
    2011
  • Firstpage
    319
  • Lastpage
    323
  • Abstract
    This paper documents recent work carried out for PeEn-SMT, our Statistical Machine Translation system for translation between the English-Persian language pair. We give details of our previous SMT system, and present our current development of significantly larger corpora. We explain how recent tests using much larger corpora helped to evaluate problems in parallel corpus alignment, corpus content, and how matching the domains of PeEn-SMT´s components affect translation outcome. We then focus on combining corpora and approaches to improve test data, showing details of experimental setup, together with a number of experiment results and comparisons between them. We show how one combination of corpora gave us a metric score outperforming Google Translate for the English-to-Persian translation. Finally, we outline areas of our intended future work, and how we plan to improve the performance of our system to achieve higher metric scores, and ultimately to provide accurate, reliable language translation.
  • Keywords
    Internet; language translation; natural language processing; English-to-Persian translation; Google Translate; PeEn-SMT; corpus content; first large scale Persian-English SMT system; parallel corpus alignment; statistical machine translation system; Data models; Google; Laboratories; Mathematical model; Measurement; NIST; Probability; Language Model; Statistical Machine translation- English-Persian;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Innovations in Information Technology (IIT), 2011 International Conference on
  • Conference_Location
    Abu Dhabi
  • Print_ISBN
    978-1-4577-0311-9
  • Type

    conf

  • DOI
    10.1109/INNOVATIONS.2011.5893841
  • Filename
    5893841