• DocumentCode
    1994164
  • Title

    Bilingual Segmenter for Statistical Machine Translation

  • Author

    Huang, Chung-Chi ; Chen, Wei-teh ; Chang, Jason S.

  • Author_Institution
    ISA, NTHU, Hsinchu, Taiwan
  • fYear
    2008
  • fDate
    15-16 Dec. 2008
  • Firstpage
    97
  • Lastpage
    104
  • Abstract
    We propose a bilingually-motivated segmenting framework for Chinese which has no clear delimiter for word boundaries. It involves producing Chinese tokens in line with word-based languages¿ words using a bilingual segmenting algorithm, provided with bitexts, and deriving a probabilistic tokenizing model based on previously annotated Chinese sentences. In the bilingual segmenting algorithm, we first convert the search for segmentation into a sequential tagging problem, allowing for a polynomial-time dynamic programming solution, and incorporate a control to balance mono- and bi-lingual information in tailoring Chinese sentences. Experiments show that our framework, applied as a pre-tokenization component, significantly outperforms existing segmenters in translation quality, suggesting our methodology supports better segmentation for bilingual NLP applications involving isolated languages such as Chinese.
  • Keywords
    computational complexity; computational linguistics; dynamic programming; language translation; natural language processing; probability; Chinese bilingual segmenting algorithm; Chinese sentence; Chinese token; bitext; natural language processing; polynomial-time dynamic programming solution; probabilistic tokenizing model; sequential tagging problem; statistical machine translation; word-based language; Decoding; Dynamic programming; Instruction sets; Natural language processing; Natural languages; Performance analysis; Polynomials; Probability; Tagging; White spaces; bilingual segmenter; conditional random fields; machine translation; phrase-based decoder; word alignment;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Universal Communication, 2008. ISUC '08. Second International Symposium on
  • Conference_Location
    Osaka
  • Print_ISBN
    978-0-7695-3433-6
  • Type

    conf

  • DOI
    10.1109/ISUC.2008.10
  • Filename
    4724447