DocumentCode :
2660165
Title :
Improving word segmentation for Thai speech translation
Author :
Charoenpornsawat, Paisarn ; Schultz, Tanja
fYear :
2008
fDate :
15-19 Dec. 2008
Firstpage :
241
Lastpage :
244
Abstract :
A vocabulary list and language model are primary components in a speech translation system. Generating both from plain text is a straightforward task for English. However, it is quite challenging for Chinese, Japanese, or Thai which provide no word segmentation, i.e. the text has no word boundary delimiter. For Thai word segmentation, maximal matching, a lexicon-based approach, is one of the popular methods. Nevertheless this method heavily relies on the coverage of the lexicon. When text contains an unknown word, this method usually produces a wrong boundary. When extracting words from this segmented text, some words will not be retrieved because of wrong segmentation. In this paper, we propose statistical techniques to tackle this problem. Based on different word segmentation methods we develop various speech translation systems and show that the proposed method can significantly improve the translation accuracy by about 6.42% BLEU points compared to the baseline system.
Keywords :
feature extraction; language translation; natural language processing; speech recognition; statistical analysis; vocabulary; Thai speech translation; language model; lexicon-based approach; maximal matching; speech recognition; statistical techniques; text segmentation; vocabulary list; word extraction; word segmentation; Automatic speech recognition; Dictionaries; Entropy; Natural language processing; Natural languages; Speech recognition; Surface-mount technology; Text processing; Training data; Vocabulary; Speech Recognition; Spoken language translation; Text Processing; Word Segmentation;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Spoken Language Technology Workshop, 2008. SLT 2008. IEEE
Conference_Location :
Goa
Print_ISBN :
978-1-4244-3471-8
Electronic_ISBN :
978-1-4244-3472-5
Type :
conf
DOI :
10.1109/SLT.2008.4777885
Filename :
4777885
Link To Document :
بازگشت