Title :
Semi-supervised learning of language model using unsupervised topic model
Author :
Bai, Shuanhu ; Huang, Chien-Lin ; Ma, Bin ; Li, Haizhou
Author_Institution :
Inst. for Infocomm Res., Singapore, Singapore
Abstract :
We present a semi-supervised learning (SSL) method for building domain-specific language models (LMs) from general-domain data using probabilistic latent semantic analysis (PLSA). The proposed technique first performs topic decomposition (TD) on the combined dataset of domain-specific and general-domain data. Then it derives latent topic distribution of the interested domain, and derives domain-specific word n-gram counts with a PLSA style mixture model. Finally, it uses traditional n-gram modeling to construct domain-specific LMs from the domain-specific word n-gram counts. Experimental results show that this technique outperforms both states-of-the-art relative entropy text selection and traditional supervised training methods.
Keywords :
learning (artificial intelligence); natural language processing; statistical analysis; PLSA style mixture model; domain-specific language models; domain-specific word n-gram counts; language model learning; probabilistic latent semantic analysis; relative entropy text selection; semi-supervised learning; topic decomposition; unsupervised topic model; Bridges; Buildings; Computer science; Domain specific languages; Entropy; Joining processes; Learning systems; Semisupervised learning; Statistical distributions; Statistics; language model; semi-supervised learning; topic model;
Conference_Titel :
Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on
Conference_Location :
Dallas, TX
Print_ISBN :
978-1-4244-4295-9
Electronic_ISBN :
1520-6149
DOI :
10.1109/ICASSP.2010.5494940