Title :
Sparse Word Graphs: A Scalable Algorithm for Capturing Word Correlations in Topic Models
Author :
Nallapati, Ramesh ; Ahmed, Amr ; Cohen, William ; Xing, Eric
Abstract :
Statistical topic models such as the Latent Dirichlet Al- location (LDA) have emerged as an attractive framework to model, visualize and summarize large document collections in a completely unsupervised fashion. One of the limitations of this family of models is their assumption of exchangeabil- ity of words within documents, which results in a `bag-of- words´ representation for documents as well as topics. As a consequence, precious information that exists in the form of correlations between words is lost in these models. In this work, we adapt recent advances in sparse mod- eling techniques to the problem of modeling word corre- lations within topics and present a new algorithm called Sparse Word Graphs. Our experiments on AP corpus re- veal both long-distance and short-distance word correla- tions within topics that are semantically very meaningful. In addition, the new algorithm is highly scalable to large collections as it captures only the most important correla- tions in a sparse manner.
Keywords :
Conferences; Data mining; Educational programs; Hidden Markov models; Linear discriminant analysis; Machine learning; Machine learning algorithms; Subspace constraints; USA Councils; Visualization;
Conference_Titel :
Data Mining Workshops, 2007. ICDM Workshops 2007. Seventh IEEE International Conference on
Conference_Location :
Omaha, NE
Print_ISBN :
978-0-7695-3019-2
Electronic_ISBN :
978-0-7695-3033-8
DOI :
10.1109/ICDMW.2007.39