DocumentCode :
2769955
Title :
Document Clustering Using K-Means, Heuristic K-Means and Fuzzy C-Means
Author :
Singh, Vivek Kumar ; Tiwari, Nisha ; Garg, Shekhar
Author_Institution :
Dept. of Comput. Sci., South Asian Univ., New Delhi, India
fYear :
2011
fDate :
7-9 Oct. 2011
Firstpage :
297
Lastpage :
301
Abstract :
Document clustering refers to unsupervised classification (categorization) of documents into groups (clusters) in such a way that the documents in a cluster are similar, whereas documents in different clusters are dissimilar. The documents may be web pages, blog posts, news articles, or other text files. This paper presents our experimental work on applying K-means, heuristic K-means and fuzzy C-means algorithms for clustering text documents. We have experimented with different representations (tf, tf.idf & Boolean) and different feature selection schemes (with or without stop word removal & with or without stemming). We ran our implementations on some standard datasets and computed various performance measures for these algorithms. The results indicate that tf.idf representation, and use of stemming obtains better clustering. Moreover, fuzzy clustering produces better results than both K-means and heuristic K-means on almost all datasets, and is a more stable method.
Keywords :
Web sites; fuzzy set theory; pattern classification; pattern clustering; text analysis; Web pages; blog posts; document clustering; feature selection schemes; fuzzy c-means; fuzzy clustering; heuristic k-means; news articles; text files; tf.idf representation; unsupervised document classification; Algorithm design and analysis; Classification algorithms; Clustering algorithms; Frequency measurement; Heuristic algorithms; Partitioning algorithms; Vectors; Cluster Evaluation; Document Clustering; Fuzzy C-means; Heuristic K-means; K-means;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computational Intelligence and Communication Networks (CICN), 2011 International Conference on
Conference_Location :
Gwalior
Print_ISBN :
978-1-4577-2033-8
Type :
conf
DOI :
10.1109/CICN.2011.62
Filename :
6112875
Link To Document :
بازگشت