Title :
Effects of Word Assignment in LDA for News Topic Discovery
Author :
Chuen-Min Huang ; Cheng-Yi Wu
Author_Institution :
Dept. of Inf. Manage., Nat. Yunlin Univ. of Sci. & Technol., Yunlin, Taiwan
Abstract :
In traditional LDA, latent variables are inferred from the "bag-of-words" assumption, in which word order is ignored. This bag-of-words assumption has gained recognition in terms of computational efficiency, whereas it is regarded impractical in many language model applications where word order is essential. In this study, we proposed word concatenation based on morphological rules as compounds and built the connection between compounds and topics. We used three categories including politics, economics, and life of Yahoo! Taiwan news from May/23/2013 to June/20/2013 and also extracted 1/3 of the news pool at random from each category as the mixed dataset. We compared unigrams and compounds in terms of topic coherence and performance, the result shows that the proposed model has a higher value of perplexity, while it illustrates more accurate meaning and computational efficiency than traditional LDA.
Keywords :
information resources; natural language processing; text analysis; LDA; Yahoo! Taiwan news; bag-of-words; compounds; computational efficiency; economics; language model applications; latent Dirichlet allocation; latent variables; mixed dataset; morphological rules; news pool; news topic discovery; politics; topic coherence; unigrams; word assignment; word concatenation; word order; Analytical models; Coherence; Compounds; Computational efficiency; Computational modeling; Context; Data models; LDA; compounds-based; topic discovery; unigram;
Conference_Titel :
Big Data (BigData Congress), 2015 IEEE International Congress on
Conference_Location :
New York, NY
Print_ISBN :
978-1-4673-7277-0
DOI :
10.1109/BigDataCongress.2015.62