Title :
Semi-supervised Learning of Domain-Specific Language Models from General Domain Data
Author :
Bai, Shuanhu ; Zhang, Min ; Li, Haizhou
Author_Institution :
Inst. for Infocomm Res., Singapore, Singapore
Abstract :
We present a semi-supervised learning method for building domain-specific language models (LM) from general-domain data. This method is aimed to use small amount of domain-specific data as seeds to tap domain-specific resources residing in larger amount of general-domain data with the help of topic modeling technologies. The proposed algorithm first performs topic decomposition (TD) on the combined dataset of domain-specific and general-domain data using probabilistic latent semantic analysis (PLSA). Then it derives domain-specific word n-gram counts with mixture modeling scheme of PLSA. Finally, it uses traditional n-gram modeling approach to construct domain-specific LMs from the domain-specific word n-gram counts. Experimental results show that this approach can outperform both stat-of-the-art methods and the simulated supervised learning method with our data sets. In particular, the semi-supervised learning method can achieve better performance even with very small amount of domain-specific data.
Keywords :
learning (artificial intelligence); natural language processing; domain-specific language models; domain-specific word n-gram counts; general domain data; natural language processing; probabilistic latent semantic analysis; semisupervised learning; simulated supervised learning method; topic decomposition; traditional n-gram modeling approach; Algorithm design and analysis; Domain specific languages; Entropy; Information filtering; Information filters; Performance analysis; Search engines; Semisupervised learning; Supervised learning; Text categorization; language model; semi-supervised learning; topic model;
Conference_Titel :
Asian Language Processing, 2009. IALP '09. International Conference on
Conference_Location :
Singapore
Print_ISBN :
978-0-7695-3904-1
DOI :
10.1109/IALP.2009.65