• DocumentCode
    1666327
  • Title

    Effects of Word Assignment in LDA for News Topic Discovery

  • Author

    Chuen-Min Huang ; Cheng-Yi Wu

  • Author_Institution
    Dept. of Inf. Manage., Nat. Yunlin Univ. of Sci. & Technol., Yunlin, Taiwan
  • fYear
    2015
  • Firstpage
    374
  • Lastpage
    380
  • Abstract
    In traditional LDA, latent variables are inferred from the "bag-of-words" assumption, in which word order is ignored. This bag-of-words assumption has gained recognition in terms of computational efficiency, whereas it is regarded impractical in many language model applications where word order is essential. In this study, we proposed word concatenation based on morphological rules as compounds and built the connection between compounds and topics. We used three categories including politics, economics, and life of Yahoo! Taiwan news from May/23/2013 to June/20/2013 and also extracted 1/3 of the news pool at random from each category as the mixed dataset. We compared unigrams and compounds in terms of topic coherence and performance, the result shows that the proposed model has a higher value of perplexity, while it illustrates more accurate meaning and computational efficiency than traditional LDA.
  • Keywords
    information resources; natural language processing; text analysis; LDA; Yahoo! Taiwan news; bag-of-words; compounds; computational efficiency; economics; language model applications; latent Dirichlet allocation; latent variables; mixed dataset; morphological rules; news pool; news topic discovery; politics; topic coherence; unigrams; word assignment; word concatenation; word order; Analytical models; Coherence; Compounds; Computational efficiency; Computational modeling; Context; Data models; LDA; compounds-based; topic discovery; unigram;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Big Data (BigData Congress), 2015 IEEE International Congress on
  • Conference_Location
    New York, NY
  • Print_ISBN
    978-1-4673-7277-0
  • Type

    conf

  • DOI
    10.1109/BigDataCongress.2015.62
  • Filename
    7207246