DocumentCode :
3559528
Title :
K-Means Clustering Versus Validation Measures: A Data-Distribution Perspective
Author :
Xiong, Hui ; Wu, Junjie ; Chen, Jian
Author_Institution :
Manage. Sci. & Inf. Syst. Dept., Rutgers Univ., Newark, NJ
Volume :
39
Issue :
2
fYear :
2009
fDate :
4/1/2009 12:00:00 AM
Firstpage :
318
Lastpage :
331
Abstract :
K-means is a well-known and widely used partitional clustering method. While there are considerable research efforts to characterize the key features of the K-means clustering algorithm, further investigation is needed to understand how data distributions can have impact on the performance of K-means clustering. To that end, in this paper, we provide a formal and organized study of the effect of skewed data distributions on K-means clustering. Along this line, we first formally illustrate that K-means tends to produce clusters of relatively uniform size, even if input data have varied ldquotruerdquo cluster sizes. In addition, we show that some clustering validation measures, such as the entropy measure, may not capture this uniform effect and provide misleading information on the clustering performance. Viewed in this light, we provide the coefficient of variation (CV) as a necessary criterion to validate the clustering results. Our findings reveal that K-means tends to produce clusters in which the variations of cluster sizes, as measured by CV, are in a range of about 0.3-1.0. Specifically, for data sets with large variation in ldquotruerdquo cluster sizes (e.g., CV > 1.0 ), K-means reduces variation in resultant cluster sizes to less than 1.0. In contrast, for data sets with small variation in ldquotruerdquo cluster sizes (e.g., CV < 0.3), K-means increases variation in resultant cluster sizes to greater than 0.3. In other words, for the earlier two cases, K-means produces the clustering results which are away from the ldquotruerdquo cluster distributions.
Keywords :
entropy; pattern clustering; data-distribution perspective; entropy measure; k-means clustering; Clustering validation; F-measure; K-means clustering; coefficient of variation (CV); entropy;
fLanguage :
English
Journal_Title :
Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on
Publisher :
ieee
Conference_Location :
12/12/2008 12:00:00 AM
ISSN :
1083-4419
Type :
jour
DOI :
10.1109/TSMCB.2008.2004559
Filename :
4711107
Link To Document :
بازگشت