Title :
Adaptive Bayesian Latent Semantic Analysis
Author :
Chien, Jen-Tzung ; Wu, Meng-Sung
Author_Institution :
Dept. of Comput. Sci. & Inf. Eng., Nat. Cheng Kung Univ., Tainan
Abstract :
Due to the vast growth of data collections, the statistical document modeling has become increasingly important in language processing areas. Probabilistic latent semantic analysis (PLSA) is a popular approach whereby the semantics and statistics can be effectively captured for modeling. However, PLSA is highly sensitive to task domain, which is continuously changing in real-world documents. In this paper, a novel Bayesian PLSA framework is presented. We focus on exploiting the incremental learning algorithm for solving the updating problem of new domain articles. This algorithm is developed to improve document modeling by incrementally extracting up-to-date latent semantic information to match the changing domains at run time. By adequately representing the priors of PLSA parameters using Dirichlet densities, the posterior densities belong to the same distribution so that a reproducible prior/posterior mechanism is activated for incremental learning from constantly accumulated documents. An incremental PLSA algorithm is constructed to accomplish the parameter estimation as well as the hyperparameter updating. Compared to standard PLSA using maximum likelihood estimate, the proposed approach is capable of performing dynamic document indexing and modeling. We also present the maximum a posteriori PLSA for corrective training. Experiments on information retrieval and document categorization demonstrate the superiority of using Bayesian PLSA methods.
Keywords :
Bayes methods; computational linguistics; learning (artificial intelligence); natural language processing; probability; text analysis; Dirichlet densities; adaptive Bayesian PLSA; incremental learning algorithm; natural language processing; parameter estimation; probabilistic latent semantic analysis; statistical document modeling; Bayesian methods; Data mining; Frequency; Indexing; Information retrieval; Matrix decomposition; Maximum likelihood estimation; Natural languages; Parameter estimation; Statistical analysis; Bayesian theory; Dirichlet distribution; conjugate prior; incremental learning; natural language processing; probabilistic latent semantic analysis; statistical document modeling;
Journal_Title :
Audio, Speech, and Language Processing, IEEE Transactions on
DOI :
10.1109/TASL.2007.909452