• DocumentCode
    188236
  • Title

    Refining LDA Results and Ranking Topics in Order of Quantity and Quality with an Application to Twitter Streaming Data

  • Author

    Fujino, Iwao

  • Author_Institution
    Sch. of Inf. & Telecommun. Eng., Tokai Univ., Tokyo, Japan
  • fYear
    2014
  • fDate
    13-15 Oct. 2014
  • Firstpage
    209
  • Lastpage
    216
  • Abstract
    Topic model is an emerging approach to summarize data, especially text data, in terms of a small set of latent variables. The most useful implement of topic model is LDA method, which is an unsupervised machine learning technique to identify latent topic information from a massive document collection. However, sometimes the LDA method gives some hard understanding or meaningless results. In order to improve this problem, in this paper we proposed a method for refining results of LDA and also ranking topics in order of some significance criterion. Our study is based on two basic assumptions. The first assumption is that the correlation coefficient between any two different topics should be zero under ideal condition. The second assumption is that the quality of topics can be defined as a deviation from background topic. Starting from these two assumptions, we provided a concrete method to determine the number of topics when using LDA method to extract topics from documents data and also to ranking the LDA results in order of quality. As a confirmation of our proposed methods, we conducted several experiments to processing Twitter streaming data. The results of these experiments show that our methods work efficiently as expected.
  • Keywords
    document handling; learning (artificial intelligence); social networking (online); LDA method; Twitter streaming data; documents data topic extraction; massive document collection; ranking topics; topic model; unsupervised machine learning technique; Correlation; Correlation coefficient; Data models; Probability distribution; Refining; Twitter; Vectors; Jensen-Shannon divergence; LDA (Latent Dirichlet Allocation); Twitter; correlation coefficient; topic model;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC), 2014 International Conference on
  • Conference_Location
    Shanghai
  • Print_ISBN
    978-1-4799-6235-8
  • Type

    conf

  • DOI
    10.1109/CyberC.2014.45
  • Filename
    6984308