Title :
Textual Document Clustering Using Topic Models
Author_Institution :
Knowledge Grid Group, Inst. of Comput. Technol., Beijing, China
Abstract :
Document clustering is to group documents according to a certain semantic features defined on the document set for measuring the similarities between two documents. The keyword models such as the TFIDF model of document have been widely used as features for document clustering. But it lacks of semantic structure, which limit its further usage in document analysis. Topic model has been developed to discover multiple probabilistic distributions over the vocabulary, which can be seen as different topic dimensions of the document set. It has a richer semantic structure than the TFIDF models. Using topic model to cluster documents, one can obtain the not only the document ids of clusters but also the topic of the clusters and the global document set. There are two major ways to use the topic models in document clustering: one is based on the basic topic model and the other is based on new cluster-oriented topic models. In this paper, we evaluate the basic clustering performance of these two types of methods. We proposed several simple clustering methods based on the basic topic model and compare them with the cluster-oriented topic model and other major clustering methods. The experimental results show that the simple method can achieve the comparable clustering accuracy and recall rate to those latest models and algorithms.
Keywords :
pattern clustering; probability; text analysis; TFIDF model; document analysis; keyword models; multiple probabilistic distributions; semantic features; textual document clustering; topic models; Clustering algorithms; Clustering methods; Computational modeling; Measurement; Probabilistic logic; Semantics; Vocabulary; Latent Dirichlet Allocation (LDA); document clustering; probabilistic topic model;
Conference_Titel :
Semantics, Knowledge and Grids (SKG), 2014 10th International Conference on
Conference_Location :
Beijing
DOI :
10.1109/SKG.2014.27