DocumentCode
1758061
Title
Probabilistic Word Selection via Topic Modeling
Author
Yueting Zhuang ; Haidong Gao ; Fei Wu ; Siliang Tang ; Yin Zhang ; Zhongfei Zhang
Author_Institution
Coll. of Comput. Sci. & Technol., Zhejiang Univ., Hangzhou, China
Volume
27
Issue
6
fYear
2015
fDate
June 1 2015
Firstpage
1643
Lastpage
1655
Abstract
We propose selective supervised Latent Dirichlet Allocation (ssLDA) to boost the prediction performance of the widely studied supervised probabilistic topic models. We introduce a Bernoulli distribution for each word in one given document to selectthis word as a strongly or weakly discriminative one with respect to its assigned topic. The Bernoulli distribution is parameterized by the discrimination power of the word for its assigned topic. As a result, the document is represented as a “bag-of-selective-words” instead of the probabilistic “bag-of-topics” in the topic modeling domain or the flat “bag-of-words” in the traditional natural language processing domain to form a new perspective. Inheriting the general framework of supervised LDA (sLDA), ssLDA can also predict many types of response specified by a Gaussian Linear Model (GLM). Focusing on the utilization of this word selection mechanism for singe-label document classification in this paper, we conduct the variational inference for approximating the intractable posterior and derive a maximum-likelihood estimation of parameters in ssLDA. The experiments reported on textual documents show that ssLDA not only performs competitively over “state-of-the-art” classification approaches based on both the flat “bag-of-words” and probabilistic “bag-of-topics” representation in terms of classification performance, but also has the ability to discover the discrimination power of the words specified in the topics (compatible with our rational knowledge).
Keywords
Gaussian processes; natural language processing; pattern classification; text analysis; Bernoulli distribution; GLM; Gaussian linear model; bag-of-selective-words; discrimination power; maximum-likelihood parameters estimation; natural language processing domain; probabilistic word selection; selective supervised latent dirichlet allocation; singe-label document classification; ssLDA; state-of-the-art classification approaches; supervised probabilistic topic models; textual documents; Analytical models; Convergence; Data models; Equations; Predictive models; Probabilistic logic; Resource management; Classification; Latent Dirichlet Allocation; Supervised learning; Topic modeling; classification; latent Dirichlet allocation; supervised learning;
fLanguage
English
Journal_Title
Knowledge and Data Engineering, IEEE Transactions on
Publisher
ieee
ISSN
1041-4347
Type
jour
DOI
10.1109/TKDE.2014.2377727
Filename
6985726
Link To Document