Title :
Large vocabulary Uyghur continuous speech recognition based on stems and suffixes
Author :
Li, Xin ; Cai, Shang ; Pan, Jielin ; Yan, Yonghong ; Yang, Yafei
Author_Institution :
THINKIT Speech Lab., Chinese Acad. of Sci., Beijing, China
fDate :
Nov. 29 2010-Dec. 3 2010
Abstract :
In this paper, we study the vocabulary design problem in Uyghur large vocabulary continuous speech recognition (LVCSR). Uyghur is an agglutinative language in which words can be formed by concatenating several suffixes to the stem. As a result, the number of word types in Uyghur is unlimited. If the word is used as the recognition unit, the out-of-vocabulary (OOV) rate will be very large with typical vocabulary sizes of 60 k-100 k. To avoid this problem, we split words into stems and suffixes and use these sub-words as the recognition units. Speech recognition experiments are performed in two test sets, one including sentences in books and another including sentences in conversations. Compared to the 80 k-word baseline, the use of stems and suffixes can alleviate the OOV rate problem dramatically and the best system reduces the word error rate (WER) from 46.5% to 44.5% in the book sentences test set and from 57.6% to 47.5% in the conversation sentences test set.
Keywords :
natural language processing; speech recognition; text analysis; vocabulary; agglutinative language; continuous speech recognition; conversation sentences test; large vocabulary Uyghur; out-of-vocabulary rate; recognition unit; stems; suffixes; vocabulary design problem; word error rate; word types; Acoustics; Books; Databases; Hidden Markov models; Speech; Speech recognition; Vocabulary; Agglutinative language; Stems and suffixes based language model; Uyghur large vocabulary continuous speech recognition;
Conference_Titel :
Chinese Spoken Language Processing (ISCSLP), 2010 7th International Symposium on
Conference_Location :
Tainan
Print_ISBN :
978-1-4244-6244-5
DOI :
10.1109/ISCSLP.2010.5684909