• DocumentCode
    1447088
  • Title

    Document Clustering in Correlation Similarity Measure Space

  • Author

    Zhang, Taiping ; Tang, Yuan Yan ; Fang, Bin ; Xiang, Yong

  • Author_Institution
    Dept. of Comput. Sci., Chongqing Univ., Chongqing, China
  • Volume
    24
  • Issue
    6
  • fYear
    2012
  • fDate
    6/1/2012 12:00:00 AM
  • Firstpage
    1002
  • Lastpage
    1013
  • Abstract
    This paper presents a new spectral clustering method called correlation preserving indexing (CPI), which is performed in the correlation similarity measure space. In this framework, the documents are projected into a low-dimensional semantic space in which the correlations between the documents in the local patches are maximized while the correlations between the documents outside these patches are minimized simultaneously. Since the intrinsic geometrical structure of the document space is often embedded in the similarities between the documents, correlation as a similarity measure is more suitable for detecting the intrinsic geometrical structure of the document space than euclidean distance. Consequently, the proposed CPI method can effectively discover the intrinsic structures embedded in high-dimensional document space. The effectiveness of the new method is demonstrated by extensive experiments conducted on various data sets and by comparison with existing document clustering methods.
  • Keywords
    correlation methods; document handling; learning (artificial intelligence); pattern clustering; correlation preserving indexing; correlation similarity measure space; document clustering; document space; euclidean distance; intrinsic geometrical structure; intrinsic structures; Clustering algorithms; Correlation; Euclidean distance; Indexing; Nearest neighbor searches; Semantics; Document clustering; correlation latent semantic indexing; correlation measure; dimensionality reduction.;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2011.49
  • Filename
    5710934