• DocumentCode
    3678652
  • Title

    Most large topic models are approximately separable

  • Author

    Weicong Ding;Prakash Ishwar;Venkatesh Saligrama

  • Author_Institution
    Department of ECE, Boston University, MA 02215, USA
  • fYear
    2015
  • Firstpage
    199
  • Lastpage
    203
  • Abstract
    Separability has recently been leveraged as a key structural condition in topic models to develop asymptotically consistent algorithms with polynomial statistical and computational efficiency guarantees. Separability corresponds to the presence of at least one novel word for each topic. Empirical estimates of topic matrices for Latent Dirichlet Allocation models have been observed to be approximately separable. Separability may be a convenient structural property, but it appears to be too restrictive a condition. In this paper we explicitly demonstrate that separability is, in fact, an inevitable consequence of high-dimensionality. In particular, we prove that when the columns of the topic matrix are independently sampled from a Dirichlet distribution, the resulting topic matrix will be approximately separable with probability tending to one as the number of rows (vocabulary size) scales to infinity sufficiently faster than the number of columns (topics). This is based on combining concentration of measure results with properties of the Dirichlet distribution and union bounding arguments. Our proof techniques can be extended to other priors for general nonnegative matrices.
  • Keywords
    "Computational modeling","Semantics","Computers"
  • Publisher
    ieee
  • Conference_Titel
    Information Theory and Applications Workshop (ITA), 2015
  • Type

    conf

  • DOI
    10.1109/ITA.2015.7308989
  • Filename
    7308989