Title :
Unsupervised Learning from Linked Documents
Author :
Guo, Zhen ; Zhu, Shenghuo ; Chi, Yun ; Zhang, Zhongfei ; Gong, Yihong
Author_Institution :
Comput. Sci. Dept., SUNY at Binghamton, Binghamton, NY, USA
Abstract :
Documents in many corpora, such as digital libraries and webpages, contain both content and link information. In a traditional topic model which plays an important role in the unsupervised learning, the link information is either totally ignored or treated as a feature similar to content. We believe that neither approach is capable of accurately capturing the relations represented by links. To address the limitation of traditional topic models, in this paper we propose a citation-topic (CT) model that explicitly considers the document relations represented by links. In the CT model, instead of being treated as yet another feature, links are used to form the structure of the generative model. As a result, in the CT model a given document is modeled as a mixture of a set of topic distributions, each of which is borrowed (cited) from a document that is related to the given document. We apply the CT model to several document collections and the experimental comparisons against state-of-the-art approaches demonstrate very promising performances.
Keywords :
Internet; digital libraries; document handling; unsupervised learning; Web pages; citation-topic model; digital libraries; linked documents; unsupervised learning; Accuracy; IP networks; Indexing; Machine learning; Measurement; Probabilistic logic; Unsupervised learning; Unsupervised learning; document clustering; latent topic model;
Conference_Titel :
Pattern Recognition (ICPR), 2010 20th International Conference on
Conference_Location :
Istanbul
Print_ISBN :
978-1-4244-7542-1
DOI :
10.1109/ICPR.2010.184