مرکز منطقه ای اطلاع رساني علوم و فناوري - Learning Topic Models -- Going beyond SVD

DocumentCode :

579971

Title :

Learning Topic Models -- Going beyond SVD

Author :

Arora, Sanjeev ; Ge, Rong ; Moitra, Ankur

Author_Institution :

CCI, Princeton Univ., Princeton, NJ, USA

fYear :

2012

fDate :

20-23 Oct. 2012

Firstpage :

Lastpage :

Abstract :

Topic Modeling is an approach used for automatic comprehension and classification of data in a variety of settings, and perhaps the canonical application is in uncovering thematic structure in a corpus of documents. A number of foundational works both in machine learning and in theory have suggested a probabilistic model for documents, whereby documents arise as a convex combination of (i.e. distribution on) a small number of topic vectors, each topic vector being a distribution on words (i.e. a vector of word-frequencies). Similar models have since been used in a variety of application areas, the Latent Dirichlet Allocation or LDA model of Blei et al. is especially popular. Theoretical studies of topic modeling focus on learning the model´s parameters assuming the data is actually generated from it. Existing approaches for the most part rely on Singular Value Decomposition (SVD), and consequently have one of two limitations: these works need to either assume that each document contains only one topic, or else can only recover the {em span} of the topic vectors instead of the topic vectors themselves. This paper formally justifies Nonnegative Matrix Factorization (NMF) as a main tool in this context, which is an analog of SVD where all vectors are nonnegative. Using this tool we give the first polynomial-time algorithm for learning topic models without the above two limitations. The algorithm uses a fairly mild assumption about the underlying topic matrix called separability, which is usually found to hold in real-life data. Perhaps the most attractive feature of our algorithm is that it generalizes to yet more realistic models that incorporate topic-topic correlations, such as the Correlated Topic Model (CTM) and the Pachinko Allocation Model (PAM). We hope that this paper will motivate further theoretical results that use NMF as a replacement for SVD -- just as NMF has come to replace SVD in many applications.

Keywords :

computational complexity; document handling; learning (artificial intelligence); pattern classification; singular value decomposition; CTM; LDA model; NMF; PAM; Pachinko allocation model; SVD; automatic data classification; automatic data comprehension; correlated topic model; document corpus; latent Dirichlet allocation; machine learning; nonnegative matrix factorization; polynomial-time algorithm; probabilistic model; separability; singular value decomposition; thematic structure; topic matrix; topic model learning; topic vectors; word-frequencies; Approximation methods; Computational modeling; Covariance matrix; Data models; Dictionaries; Noise measurement; Vectors;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Foundations of Computer Science (FOCS), 2012 IEEE 53rd Annual Symposium on

Conference_Location :

New Brunswick, NJ

ISSN :

0272-5428

Print_ISBN :

978-1-4673-4383-1

Type :

conf

DOI :

10.1109/FOCS.2012.49

Filename :

6375276

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=579971