Title :
Dirichlet Mixture Allocation for Multiclass Document Collections Modeling
Author :
Bian, Wei ; Tao, Dacheng
Author_Institution :
Sch. of Comput. Eng., Nanyang Technol. Univ., Singapore, Singapore
Abstract :
Topic model, latent Dirichlet allocation (LDA), is an effective tool for statistical analysis of large collections of documents. In LDA, each document is modeled as a mixture of topics and the topic proportions are generated from the unimodal Dirichlet distribution prior. When a collection of documents are drawn from multiple classes, this unimodal prior is insufficient for data fitting. To solve this problem, we exploit the multimodal Dirichlet mixture prior, and propose the Dirichlet mixture allocation (DMA). We report experiments on the popular TDT2 Corpus demonstrating that DMA models a collection of documents more precisely than LDA when the documents are obtained from multiple classes.
Keywords :
statistical analysis; text analysis; Dirichlet mixture allocation; TDT2 Corpus; data fitting; latent Dirichlet allocation; multiclass document collections modeling; multimodal Dirichlet mixture prior; statistical analysis; text modeling; unimodal Dirichlet distribution prior; Bayesian methods; Data engineering; Data mining; Image retrieval; Indexing; Inference algorithms; Information retrieval; Linear discriminant analysis; Statistical analysis; Vocabulary; Dirichlet mixture; latent Dirichlet allocation; multiclass; text modeling; topic model;
Conference_Titel :
Data Mining, 2009. ICDM '09. Ninth IEEE International Conference on
Conference_Location :
Miami, FL
Print_ISBN :
978-1-4244-5242-2
Electronic_ISBN :
1550-4786
DOI :
10.1109/ICDM.2009.102