Title :
On Confidence-Constrained Rank Recovery in Topic Models
Author :
Behmardi, Behrouz ; Raich, Raviv
Author_Institution :
Sch. of EECS, Oregon State Univ., Corvallis, OR, USA
Abstract :
Topic models have been proposed to model a collection of data such as text documents and images in which each object (e.g., a document) contains a set of instances (e.g., words). In many topic models, the dimension of the latent topic space (the number of topics) is assumed to be a deterministic unknown. The number of topics significantly affects the prediction performance and interpretability of the estimated topics. In this paper, we propose a confidence-constrained rank minimization (CRM) to recover the exact number of topics in topic models with theoretical guarantees on recovery probability and mean squared error of the estimation. We provide a computationally-efficient optimization algorithm for the problem to further the applicability of the proposed framework to large real world datasets. Numerical evaluations are used to verify our theoretical results. Additionally, to illustrate the applicability of the proposed framework to practical problems, we provide results in image classification for two real world datasets and text classification for three real world datasets.
Keywords :
information retrieval; mean square error methods; minimisation; probability; text analysis; CRM; computationally-efficient optimization algorithm; confidence-constrained rank minimization; confidence-constrained rank recovery; data collection; estimated topic; image classification; latent topic space; mean squared error; numerical evaluation; prediction performance; recovery probability; text classification; text document; topic model; topic recovery; Biological system modeling; Computational modeling; Customer relationship management; Matrix decomposition; Minimization; Probabilistic logic; Vocabulary; Confidence constraints; low-rank matrix recovery; nuclear norm minimization; rank estimation; topic models;
Journal_Title :
Signal Processing, IEEE Transactions on
DOI :
10.1109/TSP.2012.2208634