• DocumentCode
    2207486
  • Title

    Sequential Latent Dirichlet Allocation: Discover Underlying Topic Structures within a Document

  • Author

    Du, Lan ; Buntine, Wray ; Jin, Huidong

  • Author_Institution
    CECS, Australian Nat. Univ., Canberra, ACT, Australia
  • fYear
    2010
  • fDate
    13-17 Dec. 2010
  • Firstpage
    148
  • Lastpage
    157
  • Abstract
    Understanding how topics within a document evolve over its structure is an interesting and important problem. In this paper, we address this problem by presenting a novel variant of Latent Dirichlet Allocation (LDA): Sequential LDA (SeqLDA). This variant directly considers the underlying sequential structure, i.e., a document consists of multiple segments (e.g., chapters, paragraphs), each of which is correlated to its previous and subsequent segments. In our model, a document and its segments are modelled as random mixtures of the same set of latent topics, each of which is a distribution over words; and the topic distribution of each segment depends on that of its previous segment, the one for first segment will depend on the document topic distribution. The progressive dependency is captured by using the nested two-parameter Poisson Dirichlet process (PDP). We develop an efficient collapsed Gibbs sampling algorithm to sample from the posterior of the PDP. Our experimental results on patent documents show that by taking into account the sequential structure within a document, our SeqLDA model has a higher fidelity over LDA in terms of perplexity (a standard measure of dictionary-based compressibility). The SeqLDA model also yields a nicer sequential topic structure than LDA, as we show in experiments on books such as Melville\´s "The Whale".
  • Keywords
    data mining; sampling methods; stochastic processes; word processing; Gibbs sampling; Poisson Dirichlet process; document structure; sequential Latent Dirichlet allocation; topic distribution; topic structure; Latent Dirichlet Allocation; Poisson-Dirichlet process; collapsed Gibbs sampler; document structure;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining (ICDM), 2010 IEEE 10th International Conference on
  • Conference_Location
    Sydney, NSW
  • ISSN
    1550-4786
  • Print_ISBN
    978-1-4244-9131-5
  • Electronic_ISBN
    1550-4786
  • Type

    conf

  • DOI
    10.1109/ICDM.2010.51
  • Filename
    5693968