DocumentCode
1893267
Title
Inference for probabilistic unsupervised text clustering
Author
Rigouste, Loïs ; Cappé, Olivier ; Yvon, François
Author_Institution
Ecole Nationale Superieure des Telecommun., Paris
fYear
2005
fDate
17-20 July 2005
Firstpage
387
Lastpage
392
Abstract
In this article, we investigate the use of a simple probabilistic model for unsupervised document clustering in large collections of texts. The model consists of a mixture of multinomial distributions over the word counts, each component corresponding to a different theme. The expectation-maximization (EM) algorithm is the basic tool used for inference. After introducing the model and experimental framework (corpus and evaluation measures), we discuss the importance of initialization and illustrate the difficulty caused by the lack of supervision information. We propose some ideas to solve this problem, one of the most efficient method being based on vocabulary reduction, and finally compare those heuristics with other inference processes, such as Gibbs sampling
Keywords
expectation-maximisation algorithm; inference mechanisms; pattern clustering; probability; text analysis; unsupervised learning; vocabulary; expectation-maximization algorithm; inference process; multinomial distribution; probabilistic model; text collection; unsupervised document clustering; vocabulary reduction; Availability; Clustering algorithms; Electronic mail; Inference algorithms; Parameter estimation; Performance analysis; Sampling methods; Text mining; Vocabulary; Web pages;
fLanguage
English
Publisher
ieee
Conference_Titel
Statistical Signal Processing, 2005 IEEE/SP 13th Workshop on
Conference_Location
Novosibirsk
Print_ISBN
0-7803-9403-8
Type
conf
DOI
10.1109/SSP.2005.1628626
Filename
1628626
Link To Document