• DocumentCode
    2708790
  • Title

    On-line LDA: Adaptive Topic Models for Mining Text Streams with Applications to Topic Detection and Tracking

  • Author

    AlSumait, Loulwah ; Barbara, Daniel ; Domeniconi, Carlotta

  • Author_Institution
    Dept. of Comput. Sci., George Mason Univ., Fairfax, VA
  • fYear
    2008
  • fDate
    15-19 Dec. 2008
  • Firstpage
    3
  • Lastpage
    12
  • Abstract
    This paper presents online topic model (OLDA), a topic model that automatically captures the thematic patterns and identifies emerging topics of text streams and their changes over time. Our approach allows the topic modeling framework, specifically the latent Dirichlet allocation (LDA) model, to work in an online fashion such that it incrementally builds an up-to-date model (mixture of topics per document and mixture of words per topic) when a new document (or a set of documents) appears. A solution based on the empirical Bayes method is proposed. The idea is to incrementally update the current model according to the information inferred from the new stream of data with no need to access previous data. The dynamics of the proposed approach also provide an efficient mean to track the topics over time and detect the emerging topics in real time. Our method is evaluated both qualitatively and quantitatively using benchmark datasets. In our experiments, the OLDA has discovered interesting patterns by just analyzing a fraction of data at a time. Our tests also prove the ability of OLDA to align the topics across the epochs with which the evolution of the topics over time is captured. The OLDA is also comparable to, and sometimes better than, the original LDA in predicting the likelihood of unseen documents.
  • Keywords
    Bayes methods; data mining; text analysis; adaptive topic model; empirical Bayes method; latent Dirichlet allocation; online LDA; pattern discovery; text stream mining; topic detection; topic tracking; Application software; Benchmark testing; Computer science; Data mining; Linear discriminant analysis; Organizing; Pattern analysis; Software libraries; USA Councils; Yarn;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining, 2008. ICDM '08. Eighth IEEE International Conference on
  • Conference_Location
    Pisa
  • ISSN
    1550-4786
  • Print_ISBN
    978-0-7695-3502-9
  • Type

    conf

  • DOI
    10.1109/ICDM.2008.140
  • Filename
    4781095