Abstract :
A vocabulary list and language model are primary components in a speech translation system. Generating both from plain text is a straightforward task for English. However, it is quite challenging for Chinese, Japanese, or Thai which provide no word segmentation, i.e. the text has no word boundary delimiter. For Thai word segmentation, maximal matching, a lexicon-based approach, is one of the popular methods. Nevertheless this method heavily relies on the coverage of the lexicon. When text contains an unknown word, this method usually produces a wrong boundary. When extracting words from this segmented text, some words will not be retrieved because of wrong segmentation. In this paper, we propose statistical techniques to tackle this problem. Based on different word segmentation methods we develop various speech translation systems and show that the proposed method can significantly improve the translation accuracy by about 6.42% BLEU points compared to the baseline system.
Keywords :
feature extraction; language translation; natural language processing; speech recognition; statistical analysis; vocabulary; Thai speech translation; language model; lexicon-based approach; maximal matching; speech recognition; statistical techniques; text segmentation; vocabulary list; word extraction; word segmentation; Automatic speech recognition; Dictionaries; Entropy; Natural language processing; Natural languages; Speech recognition; Surface-mount technology; Text processing; Training data; Vocabulary; Speech Recognition; Spoken language translation; Text Processing; Word Segmentation;