• DocumentCode
    4271
  • Title

    Dirichlet Process Mixture Model for Document Clustering with Feature Partition

  • Author

    Ruizhang Huang ; Guan Yu ; Zhaojun Wang ; Jun Zhang ; Liangxing Shi

  • Author_Institution
    Coll. of Comput. Sci. & Inf., Guizhou Univ., Guiyang, China
  • Volume
    25
  • Issue
    8
  • fYear
    2013
  • fDate
    Aug. 2013
  • Firstpage
    1748
  • Lastpage
    1759
  • Abstract
    Finding the appropriate number of clusters to which documents should be partitioned is crucial in document clustering. In this paper, we propose a novel approach, namely DPMFP, to discover the latent cluster structure based on the DPM model without requiring the number of clusters as input. Document features are automatically partitioned into two groups, in particular, discriminative words and nondiscriminative words, and contribute differently to document clustering. A variational inference algorithm is investigated to infer the document collection structure as well as the partition of document words at the same time. Our experiments indicate that our proposed approach performs well on the synthetic data set as well as real data sets. The comparison between our approach and state-of-the-art document clustering approaches shows that our approach is robust and effective for document clustering.
  • Keywords
    document handling; pattern clustering; DPM model; DPMFP; Dirichlet process mixture model; document clustering; document collection structure; document words partition; feature partition; nondiscriminative words; synthetic data set; variational inference algorithm; Approximation algorithms; Approximation methods; Clustering algorithms; Data models; Equations; Inference algorithms; Mathematical model; Database management; Dirichlet process mixture model; clustering document clustering; database applications-text mining; feature partition; pattern recognition;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2012.27
  • Filename
    6152106