• DocumentCode
    2843850
  • Title

    Document clustering based on diffusion maps and a comparison of the k-means performances in various spaces

  • Author

    Allah, Fadoua Ataa ; Grosky, William I. ; Aboutajdine, Driss

  • Author_Institution
    GSCM-LRIT Lab., Mohamed V-Agdal Univ., Rabat
  • fYear
    2008
  • fDate
    6-9 July 2008
  • Firstpage
    579
  • Lastpage
    584
  • Abstract
    A great challenge of text mining arises from the increasingly large text datasets and the high dimensionality associated with natural language. In this research, a systematic study is conducted in the context of the document clustering, using the recently introduced diffusion framework and some characteristics of the singular value decomposition. This study is three-fold. First, we propose to construct a diffusion kernel based on the cosine distance. Second, we compare the performances of the k-means algorithm in four different vector spaces: Salton space, latent semantic analysis space, diffusion space based on the cosine distance, and diffusion space based on the Euclidian distance. Third, we undertake a statistical study of the k-means algorithm in the LSA space and the diffusion space based on the cosine distance. In most of our experiments, k-means in diffusion space, based on the cosine distance performs better. In addition, the running time in this space is negligible compared to the time needed for k-means in Salton space.
  • Keywords
    natural language processing; pattern clustering; singular value decomposition; text analysis; Euclidian distance; Salton space; cosine distance; diffusion maps; document clustering; k-means performances; natural language; singular value decomposition; text datasets; text mining; vector spaces; Algorithm design and analysis; Clustering algorithms; Computational Intelligence Society; Functional analysis; Kernel; Laboratories; Natural languages; Performance analysis; Singular value decomposition; Text mining;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computers and Communications, 2008. ISCC 2008. IEEE Symposium on
  • Conference_Location
    Marrakech
  • ISSN
    1530-1346
  • Print_ISBN
    978-1-4244-2702-4
  • Electronic_ISBN
    1530-1346
  • Type

    conf

  • DOI
    10.1109/ISCC.2008.4625693
  • Filename
    4625693