Title :
Inference for probabilistic unsupervised text clustering
Author :
Rigouste, Loïs ; Cappé, Olivier ; Yvon, François
Author_Institution :
Ecole Nationale Superieure des Telecommun., Paris
Abstract :
In this article, we investigate the use of a simple probabilistic model for unsupervised document clustering in large collections of texts. The model consists of a mixture of multinomial distributions over the word counts, each component corresponding to a different theme. The expectation-maximization (EM) algorithm is the basic tool used for inference. After introducing the model and experimental framework (corpus and evaluation measures), we discuss the importance of initialization and illustrate the difficulty caused by the lack of supervision information. We propose some ideas to solve this problem, one of the most efficient method being based on vocabulary reduction, and finally compare those heuristics with other inference processes, such as Gibbs sampling
Keywords :
expectation-maximisation algorithm; inference mechanisms; pattern clustering; probability; text analysis; unsupervised learning; vocabulary; expectation-maximization algorithm; inference process; multinomial distribution; probabilistic model; text collection; unsupervised document clustering; vocabulary reduction; Availability; Clustering algorithms; Electronic mail; Inference algorithms; Parameter estimation; Performance analysis; Sampling methods; Text mining; Vocabulary; Web pages;
Conference_Titel :
Statistical Signal Processing, 2005 IEEE/SP 13th Workshop on
Conference_Location :
Novosibirsk
Print_ISBN :
0-7803-9403-8
DOI :
10.1109/SSP.2005.1628626