• DocumentCode
    1557513
  • Title

    On Confidence-Constrained Rank Recovery in Topic Models

  • Author

    Behmardi, Behrouz ; Raich, Raviv

  • Author_Institution
    Sch. of EECS, Oregon State Univ., Corvallis, OR, USA
  • Volume
    60
  • Issue
    10
  • fYear
    2012
  • Firstpage
    5146
  • Lastpage
    5162
  • Abstract
    Topic models have been proposed to model a collection of data such as text documents and images in which each object (e.g., a document) contains a set of instances (e.g., words). In many topic models, the dimension of the latent topic space (the number of topics) is assumed to be a deterministic unknown. The number of topics significantly affects the prediction performance and interpretability of the estimated topics. In this paper, we propose a confidence-constrained rank minimization (CRM) to recover the exact number of topics in topic models with theoretical guarantees on recovery probability and mean squared error of the estimation. We provide a computationally-efficient optimization algorithm for the problem to further the applicability of the proposed framework to large real world datasets. Numerical evaluations are used to verify our theoretical results. Additionally, to illustrate the applicability of the proposed framework to practical problems, we provide results in image classification for two real world datasets and text classification for three real world datasets.
  • Keywords
    information retrieval; mean square error methods; minimisation; probability; text analysis; CRM; computationally-efficient optimization algorithm; confidence-constrained rank minimization; confidence-constrained rank recovery; data collection; estimated topic; image classification; latent topic space; mean squared error; numerical evaluation; prediction performance; recovery probability; text classification; text document; topic model; topic recovery; Biological system modeling; Computational modeling; Customer relationship management; Matrix decomposition; Minimization; Probabilistic logic; Vocabulary; Confidence constraints; low-rank matrix recovery; nuclear norm minimization; rank estimation; topic models;
  • fLanguage
    English
  • Journal_Title
    Signal Processing, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1053-587X
  • Type

    jour

  • DOI
    10.1109/TSP.2012.2208634
  • Filename
    6239610