Large vocabulary Uyghur continuous speech recognition based on stems and suffixes

Author

Li, Xin ; Cai, Shang ; Pan, Jielin ; Yan, Yonghong ; Yang, Yafei

Author_Institution

THINKIT Speech Lab., Chinese Acad. of Sci., Beijing, China

fYear

2010

fDate

Nov. 29 2010-Dec. 3 2010

Firstpage

220

Lastpage

223

Abstract

In this paper, we study the vocabulary design problem in Uyghur large vocabulary continuous speech recognition (LVCSR). Uyghur is an agglutinative language in which words can be formed by concatenating several suffixes to the stem. As a result, the number of word types in Uyghur is unlimited. If the word is used as the recognition unit, the out-of-vocabulary (OOV) rate will be very large with typical vocabulary sizes of 60 k-100 k. To avoid this problem, we split words into stems and suffixes and use these sub-words as the recognition units. Speech recognition experiments are performed in two test sets, one including sentences in books and another including sentences in conversations. Compared to the 80 k-word baseline, the use of stems and suffixes can alleviate the OOV rate problem dramatically and the best system reduces the word error rate (WER) from 46.5% to 44.5% in the book sentences test set and from 57.6% to 47.5% in the conversation sentences test set.

Keywords

natural language processing; speech recognition; text analysis; vocabulary; agglutinative language; continuous speech recognition; conversation sentences test; large vocabulary Uyghur; out-of-vocabulary rate; recognition unit; stems; suffixes; vocabulary design problem; word error rate; word types; Acoustics; Books; Databases; Hidden Markov models; Speech; Speech recognition; Vocabulary; Agglutinative language; Stems and suffixes based language model; Uyghur large vocabulary continuous speech recognition;

fLanguage

English

Publisher

ieee

Conference_Titel

Chinese Spoken Language Processing (ISCSLP), 2010 7th International Symposium on

Conference_Location

Tainan

Print_ISBN

978-1-4244-6244-5

Type

conf

DOI

10.1109/ISCSLP.2010.5684909

Filename

5684909