• DocumentCode
    1205938
  • Title

    Document clustering using locality preserving indexing

  • Author

    Cai, Deng ; He, Xiaofei ; Han, Jiawei

  • Author_Institution
    Dept. of Comput. Sci., Illinois Univ., Urbana, IL, USA
  • Volume
    17
  • Issue
    12
  • fYear
    2005
  • Firstpage
    1624
  • Lastpage
    1637
  • Abstract
    We propose a novel document clustering method which aims to cluster the documents into different semantic classes. The document space is generally of high dimensionality and clustering in such a high dimensional space is often infeasible due to the curse of dimensionality. By using locality preserving indexing (LPI), the documents can be projected into a lower-dimensional semantic space in which the documents related to the same semantics are close to each other. Different from previous document clustering methods based on latent semantic indexing (LSI) or nonnegative matrix factorization (NMF), our method tries to discover both the geometric and discriminating structures of the document space. Theoretical analysis of our method shows that LPI is an unsupervised approximation of the supervised linear discriminant analysis (LDA) method, which gives the intuitive motivation of our method. Extensive experimental evaluations are performed on the Reuters-21578 and TDT2 data sets.
  • Keywords
    data mining; database indexing; document handling; pattern clustering; Reuters-21578 data set; TDT2 data set; discriminating structures; document clustering method; geometric structures; locality preserving indexing; lower-dimensional semantic space; supervised linear discriminant analysis; Clustering algorithms; Clustering methods; Geometry; Helium; Indexing; Laplace equations; Large scale integration; Linear discriminant analysis; Performance evaluation; Stochastic processes; Index Terms- Document clustering; dimensionality reduction; locality preserving indexing; semantics.;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2005.198
  • Filename
    1524963