Title :
Finding main topics in blogosphere using document clustering based on topic model
Author :
Xuan, Wen-Feng ; Liu, Bing-quan ; Sun, Cheng-jie ; De-Yuan Zhang ; Wang, Xiao-long
Author_Institution :
Intell. Technol. & Natural Language Process. Lab., Harbin Inst. of Technol., Harbin, China
Abstract :
Along with the rapid growth of user generated content in blogosphere, it becomes more and more difficult for users to get the information they want. An effective way of organizing the information in blogosphere is becoming increasingly important under this circumstance. In this paper, a triple layer method is presented. The first layer is for document representation, which is based on topic model latent dirichlet allocation. The second one is for document clustering, which is on the basis of Markov cluster algorithm. The last layer is for topic words generation. Using this method, information in blogosphere, such as blog posts, can be organized in terms of topic. A topic is expressed by a list of topic words. Empirical study on real-world blog site CSDN shows that the proposed method is effective. Besides, this method can also provide a convenient way for users to access and explore information in blogosphere.
Keywords :
Internet; Markov processes; Web sites; document handling; pattern clustering; Markov cluster algorithm; blogosphere information; document clustering; document representation; latent dirichlet allocation; topic model; triple layer method; Algorithm design and analysis; Blogs; Clustering algorithms; Cybernetics; Indexes; Internet; Machine learning; Clustering description; Dimension reduction; Document representation; Topic model; Weblogs;
Conference_Titel :
Machine Learning and Cybernetics (ICMLC), 2011 International Conference on
Conference_Location :
Guilin
Print_ISBN :
978-1-4577-0305-8
DOI :
10.1109/ICMLC.2011.6016947