DocumentCode :
1629181
Title :
WordNet-Based and N-Grams-Based Document Clustering: A Comparative Study
Author :
Amine, Abdelmalek ; Elberrichi, Zakaria ; Simonet, Michel ; Malki, Mimoun
Author_Institution :
Dept. of Comput. Sci., UDL Univ., Sidi Belabbes
fYear :
2008
Firstpage :
394
Lastpage :
401
Abstract :
A great number of methods of unsupervised classifications also called clustering were applied to the textual documents. In this paper, we initially propose the method of the self-organizing maps of Kohonen for the clustering of the textual documents based on the n-grams representation. The same method based on the synsets of WordNet as terms for the representation of the textual documents will be studied thereafter. The effects of these methods are examined in several experiments using 4 measurements of similarity: the Cosine distance, the Euclidean distance, the squared Euclidean distance and the Manhattan distance. The reuters-21578 corpus is used for evaluation. The evaluation was done, by using the F-measure and the entropy. The results obtained show that in spite of the good results obtained by the method of the n-grams, the fact of adding lexical knowledge in the representation makes it possible to build a better classification.
Keywords :
document handling; pattern classification; pattern clustering; self-organising feature maps; Cosine distance; Euclidean distance; F-measure; Kohonen self-organizing maps; Manhattan distance; WordNet; data representation; document clustering; entropy; lexical knowledge; n-grams representation; reuters-21578 corpus; squared Euclidean distance; unsupervised classifications; Application software; Biomedical measurements; Broadband communication; Computer science; Entropy; Euclidean distance; Information technology; Internet; Laboratories; Self organizing feature maps; Document clustering; WordNet; n-grams; reuters-21578; self-organizing maps of Kohonen; similarity;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Broadband Communications, Information Technology & Biomedical Applications, 2008 Third International Conference on
Conference_Location :
Gauteng
Print_ISBN :
978-1-4244-3281-3
Electronic_ISBN :
978-0-7695-3453-4
Type :
conf
DOI :
10.1109/BROADCOM.2008.7
Filename :
4696139
Link To Document :
بازگشت