DocumentCode :
2194344
Title :
Semi-supervised PLSA for Document Clustering
Author :
Niu, Lingfeng ; Shi, Yong
Author_Institution :
Res. Center on Fictitious Econ. & Data Sci., Chinese Acad. of Sci., Beijing, China
fYear :
2010
fDate :
13-13 Dec. 2010
Firstpage :
1196
Lastpage :
1203
Abstract :
By utilizing the must-link or cannot-link pair wise constraints in data, semi-supervised clustering improves the performance of unsupervised clustering significantly. A number of semi-supervised clustering algorithms have been proposed to consider such pair wise constraints. However, most of them assign a hard label to each data item and produce little information about the cluster itself. In this work, we propose a Probabilistic Latent Semantic Analysis(PLSA) based semi-supervised algorithm for documents clustering by employing the must-link supervision between two documents, which is available in many real world data. The new algorithm can produce the soft cluster label assignment for each document as well as the probabilistic representation of latent topics in the cluster. No additional parameters need to be estimated besides the parameters in standard PLSA. This reduces the risk of over-fitting especially when the data is sparse. We provide the Expectation Maximization(EM) procedure for semi-supervised PLSA to determine the local optimal parameters that maximize the likelihood. To utilize multiple computation nodes for large scale data set, we also propose a distributed implementation of the EM procedure based on the MapReduce framework. Experimental results on public data set validate the effectiveness and efficiency of the new method.
Keywords :
distributed algorithms; document handling; expectation-maximisation algorithm; pattern clustering; probability; unsupervised learning; MapReduce; distributed implementation; document clustering; expectation maximization; pairwise constraints; probabilistic latent semantic analysis; semisupervised PLSA; soft cluster label assignment; Distributed Algorithm; PLSA; Semi-supervised Clustering; Topic Model;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Mining Workshops (ICDMW), 2010 IEEE International Conference on
Conference_Location :
Sydney, NSW
Print_ISBN :
978-1-4244-9244-2
Electronic_ISBN :
978-0-7695-4257-7
Type :
conf
DOI :
10.1109/ICDMW.2010.85
Filename :
5693430
Link To Document :
بازگشت