DocumentCode
1994164
Title
Bilingual Segmenter for Statistical Machine Translation
Author
Huang, Chung-Chi ; Chen, Wei-teh ; Chang, Jason S.
Author_Institution
ISA, NTHU, Hsinchu, Taiwan
fYear
2008
fDate
15-16 Dec. 2008
Firstpage
97
Lastpage
104
Abstract
We propose a bilingually-motivated segmenting framework for Chinese which has no clear delimiter for word boundaries. It involves producing Chinese tokens in line with word-based languages¿ words using a bilingual segmenting algorithm, provided with bitexts, and deriving a probabilistic tokenizing model based on previously annotated Chinese sentences. In the bilingual segmenting algorithm, we first convert the search for segmentation into a sequential tagging problem, allowing for a polynomial-time dynamic programming solution, and incorporate a control to balance mono- and bi-lingual information in tailoring Chinese sentences. Experiments show that our framework, applied as a pre-tokenization component, significantly outperforms existing segmenters in translation quality, suggesting our methodology supports better segmentation for bilingual NLP applications involving isolated languages such as Chinese.
Keywords
computational complexity; computational linguistics; dynamic programming; language translation; natural language processing; probability; Chinese bilingual segmenting algorithm; Chinese sentence; Chinese token; bitext; natural language processing; polynomial-time dynamic programming solution; probabilistic tokenizing model; sequential tagging problem; statistical machine translation; word-based language; Decoding; Dynamic programming; Instruction sets; Natural language processing; Natural languages; Performance analysis; Polynomials; Probability; Tagging; White spaces; bilingual segmenter; conditional random fields; machine translation; phrase-based decoder; word alignment;
fLanguage
English
Publisher
ieee
Conference_Titel
Universal Communication, 2008. ISUC '08. Second International Symposium on
Conference_Location
Osaka
Print_ISBN
978-0-7695-3433-6
Type
conf
DOI
10.1109/ISUC.2008.10
Filename
4724447
Link To Document