• DocumentCode
    1893267
  • Title

    Inference for probabilistic unsupervised text clustering

  • Author

    Rigouste, Loïs ; Cappé, Olivier ; Yvon, François

  • Author_Institution
    Ecole Nationale Superieure des Telecommun., Paris
  • fYear
    2005
  • fDate
    17-20 July 2005
  • Firstpage
    387
  • Lastpage
    392
  • Abstract
    In this article, we investigate the use of a simple probabilistic model for unsupervised document clustering in large collections of texts. The model consists of a mixture of multinomial distributions over the word counts, each component corresponding to a different theme. The expectation-maximization (EM) algorithm is the basic tool used for inference. After introducing the model and experimental framework (corpus and evaluation measures), we discuss the importance of initialization and illustrate the difficulty caused by the lack of supervision information. We propose some ideas to solve this problem, one of the most efficient method being based on vocabulary reduction, and finally compare those heuristics with other inference processes, such as Gibbs sampling
  • Keywords
    expectation-maximisation algorithm; inference mechanisms; pattern clustering; probability; text analysis; unsupervised learning; vocabulary; expectation-maximization algorithm; inference process; multinomial distribution; probabilistic model; text collection; unsupervised document clustering; vocabulary reduction; Availability; Clustering algorithms; Electronic mail; Inference algorithms; Parameter estimation; Performance analysis; Sampling methods; Text mining; Vocabulary; Web pages;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Statistical Signal Processing, 2005 IEEE/SP 13th Workshop on
  • Conference_Location
    Novosibirsk
  • Print_ISBN
    0-7803-9403-8
  • Type

    conf

  • DOI
    10.1109/SSP.2005.1628626
  • Filename
    1628626