DocumentCode :
579971
Title :
Learning Topic Models -- Going beyond SVD
Author :
Arora, Sanjeev ; Ge, Rong ; Moitra, Ankur
Author_Institution :
CCI, Princeton Univ., Princeton, NJ, USA
fYear :
2012
fDate :
20-23 Oct. 2012
Firstpage :
1
Lastpage :
10
Abstract :
Topic Modeling is an approach used for automatic comprehension and classification of data in a variety of settings, and perhaps the canonical application is in uncovering thematic structure in a corpus of documents. A number of foundational works both in machine learning and in theory have suggested a probabilistic model for documents, whereby documents arise as a convex combination of (i.e. distribution on) a small number of topic vectors, each topic vector being a distribution on words (i.e. a vector of word-frequencies). Similar models have since been used in a variety of application areas, the Latent Dirichlet Allocation or LDA model of Blei et al. is especially popular. Theoretical studies of topic modeling focus on learning the model´s parameters assuming the data is actually generated from it. Existing approaches for the most part rely on Singular Value Decomposition (SVD), and consequently have one of two limitations: these works need to either assume that each document contains only one topic, or else can only recover the {em span} of the topic vectors instead of the topic vectors themselves. This paper formally justifies Nonnegative Matrix Factorization (NMF) as a main tool in this context, which is an analog of SVD where all vectors are nonnegative. Using this tool we give the first polynomial-time algorithm for learning topic models without the above two limitations. The algorithm uses a fairly mild assumption about the underlying topic matrix called separability, which is usually found to hold in real-life data. Perhaps the most attractive feature of our algorithm is that it generalizes to yet more realistic models that incorporate topic-topic correlations, such as the Correlated Topic Model (CTM) and the Pachinko Allocation Model (PAM). We hope that this paper will motivate further theoretical results that use NMF as a replacement for SVD -- just as NMF has come to replace SVD in many applications.
Keywords :
computational complexity; document handling; learning (artificial intelligence); pattern classification; singular value decomposition; CTM; LDA model; NMF; PAM; Pachinko allocation model; SVD; automatic data classification; automatic data comprehension; correlated topic model; document corpus; latent Dirichlet allocation; machine learning; nonnegative matrix factorization; polynomial-time algorithm; probabilistic model; separability; singular value decomposition; thematic structure; topic matrix; topic model learning; topic vectors; word-frequencies; Approximation methods; Computational modeling; Covariance matrix; Data models; Dictionaries; Noise measurement; Vectors;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Foundations of Computer Science (FOCS), 2012 IEEE 53rd Annual Symposium on
Conference_Location :
New Brunswick, NJ
ISSN :
0272-5428
Print_ISBN :
978-1-4673-4383-1
Type :
conf
DOI :
10.1109/FOCS.2012.49
Filename :
6375276
Link To Document :
بازگشت