Title : 
Document clustering based on diffusion maps and a comparison of the k-means performances in various spaces
         
        
            Author : 
Allah, Fadoua Ataa ; Grosky, William I. ; Aboutajdine, Driss
         
        
            Author_Institution : 
GSCM-LRIT Lab., Mohamed V-Agdal Univ., Rabat
         
        
        
        
        
        
            Abstract : 
A great challenge of text mining arises from the increasingly large text datasets and the high dimensionality associated with natural language. In this research, a systematic study is conducted in the context of the document clustering, using the recently introduced diffusion framework and some characteristics of the singular value decomposition. This study is three-fold. First, we propose to construct a diffusion kernel based on the cosine distance. Second, we compare the performances of the k-means algorithm in four different vector spaces: Salton space, latent semantic analysis space, diffusion space based on the cosine distance, and diffusion space based on the Euclidian distance. Third, we undertake a statistical study of the k-means algorithm in the LSA space and the diffusion space based on the cosine distance. In most of our experiments, k-means in diffusion space, based on the cosine distance performs better. In addition, the running time in this space is negligible compared to the time needed for k-means in Salton space.
         
        
            Keywords : 
natural language processing; pattern clustering; singular value decomposition; text analysis; Euclidian distance; Salton space; cosine distance; diffusion maps; document clustering; k-means performances; natural language; singular value decomposition; text datasets; text mining; vector spaces; Algorithm design and analysis; Clustering algorithms; Computational Intelligence Society; Functional analysis; Kernel; Laboratories; Natural languages; Performance analysis; Singular value decomposition; Text mining;
         
        
        
        
            Conference_Titel : 
Computers and Communications, 2008. ISCC 2008. IEEE Symposium on
         
        
            Conference_Location : 
Marrakech
         
        
        
            Print_ISBN : 
978-1-4244-2702-4
         
        
            Electronic_ISBN : 
1530-1346
         
        
        
            DOI : 
10.1109/ISCC.2008.4625693