DocumentCode
3300217
Title
Automatic clustering of part-of-speech for vocabulary divided PLSA language model
Author
Suzuki, Motoyuki ; Kuriyama, Naoto ; Ito, Akinori ; Makino, Shozo
Author_Institution
Univ. of Tokushima, Tokushima
fYear
2008
fDate
19-22 Oct. 2008
Firstpage
1
Lastpage
7
Abstract
PLSA is one of the most powerful language models for adaptation to a target speech. The vocabulary divided PLSA language model (VD-PLSA) shows higher performance than the conventional PLSA model because it can be adapted to the target topic and the target speaking style individually. However, all of the vocabulary must be manually divided into three categories (topic, speaking style, and general category). In this paper, an automatic method for clustering parts-of-speech (POS) is proposed for VD-PLSA. Several corpora with different styles are prepared, and the distance between corpora in terms of POS is calculated. The "general tendency score" and "style tendency score" for each POS are calculated based on the distance between corpora. All of the POS are divided into three categories using two scores and appropriate thresholds. Experimental results showed the proposed method formed appropriate clusters, and VD-PLSA with acquired categories gave the highest performance of all other models. We applied the VD-PLSA into large vocabulary continuous speech recognition system. VD-PLSA improved the recognition accuracy for documents with lower out-of-vocabulary ratio, while other documents were not improved or slightly descended the accuracy.
Keywords
natural language processing; pattern clustering; speech processing; speech recognition; automatic clustering; documents; general tendency score; out-of-vocabulary ratio; part-of-speech; powerful language model; speaking style; style tendency score; target speaking; target speech; vocabulary continuous speech recognition system; vocabulary divided PLSA language model; Adaptation model; Clustering algorithms; Indium tin oxide; Natural languages; Predictive models; Probability; Speech recognition; Statistics; Training data; Vocabulary; Vocabulary divided PLSA; general/style tendency score; language model; part-of-speech; speech recognition;
fLanguage
English
Publisher
ieee
Conference_Titel
Natural Language Processing and Knowledge Engineering, 2008. NLP-KE '08. International Conference on
Conference_Location
Beijing
Print_ISBN
978-1-4244-4515-8
Electronic_ISBN
978-1-4244-2780-2
Type
conf
DOI
10.1109/NLPKE.2008.4906747
Filename
4906747
Link To Document