• DocumentCode
    1758061
  • Title

    Probabilistic Word Selection via Topic Modeling

  • Author

    Yueting Zhuang ; Haidong Gao ; Fei Wu ; Siliang Tang ; Yin Zhang ; Zhongfei Zhang

  • Author_Institution
    Coll. of Comput. Sci. & Technol., Zhejiang Univ., Hangzhou, China
  • Volume
    27
  • Issue
    6
  • fYear
    2015
  • fDate
    June 1 2015
  • Firstpage
    1643
  • Lastpage
    1655
  • Abstract
    We propose selective supervised Latent Dirichlet Allocation (ssLDA) to boost the prediction performance of the widely studied supervised probabilistic topic models. We introduce a Bernoulli distribution for each word in one given document to selectthis word as a strongly or weakly discriminative one with respect to its assigned topic. The Bernoulli distribution is parameterized by the discrimination power of the word for its assigned topic. As a result, the document is represented as a “bag-of-selective-words” instead of the probabilistic “bag-of-topics” in the topic modeling domain or the flat “bag-of-words” in the traditional natural language processing domain to form a new perspective. Inheriting the general framework of supervised LDA (sLDA), ssLDA can also predict many types of response specified by a Gaussian Linear Model (GLM). Focusing on the utilization of this word selection mechanism for singe-label document classification in this paper, we conduct the variational inference for approximating the intractable posterior and derive a maximum-likelihood estimation of parameters in ssLDA. The experiments reported on textual documents show that ssLDA not only performs competitively over “state-of-the-art” classification approaches based on both the flat “bag-of-words” and probabilistic “bag-of-topics” representation in terms of classification performance, but also has the ability to discover the discrimination power of the words specified in the topics (compatible with our rational knowledge).
  • Keywords
    Gaussian processes; natural language processing; pattern classification; text analysis; Bernoulli distribution; GLM; Gaussian linear model; bag-of-selective-words; discrimination power; maximum-likelihood parameters estimation; natural language processing domain; probabilistic word selection; selective supervised latent dirichlet allocation; singe-label document classification; ssLDA; state-of-the-art classification approaches; supervised probabilistic topic models; textual documents; Analytical models; Convergence; Data models; Equations; Predictive models; Probabilistic logic; Resource management; Classification; Latent Dirichlet Allocation; Supervised learning; Topic modeling; classification; latent Dirichlet allocation; supervised learning;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2014.2377727
  • Filename
    6985726