• DocumentCode
    52610
  • Title

    BTM: Topic Modeling over Short Texts

  • Author

    Xueqi Cheng ; Xiaohui Yan ; Yanyan Lan ; Jiafeng Guo

  • Author_Institution
    Inst. of Comput. Technol., Beijing, China
  • Volume
    26
  • Issue
    12
  • fYear
    2014
  • fDate
    Dec. 1 2014
  • Firstpage
    2928
  • Lastpage
    2941
  • Abstract
    Short texts are popular on today´s web, especially with the emergence of social media. Inferring topics from large scale short texts becomes a critical but challenging task for many content analysis tasks. Conventional topic models such as latent Dirichlet allocation (LDA) and probabilistic latent semantic analysis (PLSA) learn topics from document-level word co-occurrences by modeling each document as a mixture of topics, whose inference suffers from the sparsity of word co-occurrence patterns in short texts. In this paper, we propose a novel way for short text topic modeling, referred as biterm topic model (BTM). BTM learns topics by directly modeling the generation of word co-occurrence patterns (i.e., biterms) in the corpus, making the inference effective with the rich corpus-level information. To cope with large scale short text data, we further introduce two online algorithms for BTM for efficient topic learning. Experiments on real-word short text collections show that BTM can discover more prominent and coherent topics, and significantly outperform the state-of-the-art baselines. We also demonstrate the appealing performance of the two online BTM algorithms on both time efficiency and topic learning.
  • Keywords
    content management; inference mechanisms; social networking (online); text analysis; word processing; biterm topic model; content analysis; corpus level information; inference mechanism; large scale short text data collection; online BTM algorithms; short text topic modeling; social media; time efficiency; topic learning; word co-occurrence patterns; Algorithm design and analysis; Analytical models; Context modeling; Data models; Inference algorithms; Semantics; Time complexity; Short text; biterm; content analysis; online algorithm; topic model;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2014.2313872
  • Filename
    6778764