• DocumentCode
    3166310
  • Title

    Bayesian Folding-In with Dirichlet Kernels for PLSI

  • Author

    Hinneburg, Alexander ; Gabriel, Hans-Henning ; Gohr, Andrè

  • Author_Institution
    Martin-Luther-Univ., Halle-Wittenberg
  • fYear
    2007
  • fDate
    28-31 Oct. 2007
  • Firstpage
    499
  • Lastpage
    504
  • Abstract
    Probabilistic latent semantic indexing (PLSI) represents documents of a collection as mixture proportions of latent topics, which are learned from the collection by an expectation maximization (EM) algorithm. New documents or queries need to be folded into the latent topic space by a simplified version of the EM-algorithm. During PLSI- Folding-in of a new document, the topic mixtures of the known documents are ignored. This may lead to a suboptimal model of the extended collection. Our new approach incorporates the topic mixtures of the known documents in a Bayesian way during folding- in. That knowledge is modeled as prior distribution over the topic simplex using a kernel density estimate of Dirichlet kernels. We demonstrate the advantages of the new Bayesian folding-in using real text data.
  • Keywords
    Bayes methods; document handling; expectation-maximisation algorithm; indexing; probability; Bayesian folding-in; Dirichlet kernels; PLSI-folding-in; expectation maximization algorithm; known documents; latent topics; probabilistic latent semantic indexing; Bayesian methods; Biochemistry; Costs; Data mining; Graphical models; Indexing; Kernel; Linear discriminant analysis; Runtime; Text mining;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining, 2007. ICDM 2007. Seventh IEEE International Conference on
  • Conference_Location
    Omaha, NE
  • ISSN
    1550-4786
  • Print_ISBN
    978-0-7695-3018-5
  • Type

    conf

  • DOI
    10.1109/ICDM.2007.15
  • Filename
    4470280