DocumentCode
2769955
Title
Document Clustering Using K-Means, Heuristic K-Means and Fuzzy C-Means
Author
Singh, Vivek Kumar ; Tiwari, Nisha ; Garg, Shekhar
Author_Institution
Dept. of Comput. Sci., South Asian Univ., New Delhi, India
fYear
2011
fDate
7-9 Oct. 2011
Firstpage
297
Lastpage
301
Abstract
Document clustering refers to unsupervised classification (categorization) of documents into groups (clusters) in such a way that the documents in a cluster are similar, whereas documents in different clusters are dissimilar. The documents may be web pages, blog posts, news articles, or other text files. This paper presents our experimental work on applying K-means, heuristic K-means and fuzzy C-means algorithms for clustering text documents. We have experimented with different representations (tf, tf.idf & Boolean) and different feature selection schemes (with or without stop word removal & with or without stemming). We ran our implementations on some standard datasets and computed various performance measures for these algorithms. The results indicate that tf.idf representation, and use of stemming obtains better clustering. Moreover, fuzzy clustering produces better results than both K-means and heuristic K-means on almost all datasets, and is a more stable method.
Keywords
Web sites; fuzzy set theory; pattern classification; pattern clustering; text analysis; Web pages; blog posts; document clustering; feature selection schemes; fuzzy c-means; fuzzy clustering; heuristic k-means; news articles; text files; tf.idf representation; unsupervised document classification; Algorithm design and analysis; Classification algorithms; Clustering algorithms; Frequency measurement; Heuristic algorithms; Partitioning algorithms; Vectors; Cluster Evaluation; Document Clustering; Fuzzy C-means; Heuristic K-means; K-means;
fLanguage
English
Publisher
ieee
Conference_Titel
Computational Intelligence and Communication Networks (CICN), 2011 International Conference on
Conference_Location
Gwalior
Print_ISBN
978-1-4577-2033-8
Type
conf
DOI
10.1109/CICN.2011.62
Filename
6112875
Link To Document