DocumentCode
2843850
Title
Document clustering based on diffusion maps and a comparison of the k-means performances in various spaces
Author
Allah, Fadoua Ataa ; Grosky, William I. ; Aboutajdine, Driss
Author_Institution
GSCM-LRIT Lab., Mohamed V-Agdal Univ., Rabat
fYear
2008
fDate
6-9 July 2008
Firstpage
579
Lastpage
584
Abstract
A great challenge of text mining arises from the increasingly large text datasets and the high dimensionality associated with natural language. In this research, a systematic study is conducted in the context of the document clustering, using the recently introduced diffusion framework and some characteristics of the singular value decomposition. This study is three-fold. First, we propose to construct a diffusion kernel based on the cosine distance. Second, we compare the performances of the k-means algorithm in four different vector spaces: Salton space, latent semantic analysis space, diffusion space based on the cosine distance, and diffusion space based on the Euclidian distance. Third, we undertake a statistical study of the k-means algorithm in the LSA space and the diffusion space based on the cosine distance. In most of our experiments, k-means in diffusion space, based on the cosine distance performs better. In addition, the running time in this space is negligible compared to the time needed for k-means in Salton space.
Keywords
natural language processing; pattern clustering; singular value decomposition; text analysis; Euclidian distance; Salton space; cosine distance; diffusion maps; document clustering; k-means performances; natural language; singular value decomposition; text datasets; text mining; vector spaces; Algorithm design and analysis; Clustering algorithms; Computational Intelligence Society; Functional analysis; Kernel; Laboratories; Natural languages; Performance analysis; Singular value decomposition; Text mining;
fLanguage
English
Publisher
ieee
Conference_Titel
Computers and Communications, 2008. ISCC 2008. IEEE Symposium on
Conference_Location
Marrakech
ISSN
1530-1346
Print_ISBN
978-1-4244-2702-4
Electronic_ISBN
1530-1346
Type
conf
DOI
10.1109/ISCC.2008.4625693
Filename
4625693
Link To Document