DocumentCode :
3434182
Title :
Competitive K-Means, a New Accurate and Distributed K-Means Algorithm for Large Datasets
Author :
Esteves, R.M. ; Hacker, T. ; Chunming Rong
Author_Institution :
Dept. of Electr. & Comput. Eng., Univ. of Stavanger, Stavanger, Norway
Volume :
1
fYear :
2013
fDate :
2-5 Dec. 2013
Firstpage :
17
Lastpage :
24
Abstract :
The tremendous growth in data volumes has created a need for new tools and algorithms to quickly analyze large datasets. Cluster analysis techniques, such as K-means can be used for large datasets distributed across several machines. The accuracy of K-means depends on the selection of seed centroids during initialization. K-means++ improves on the K-means seeder, but suffers from problems when it is applied to large datasets: (a) the random algorithm it employs can produce inconsistent results across several analysis runs under the same initial conditions; and (b) it scales poorly for large datasets. In this paper we describe a new Competitive K-means algorithm we developed that addresses both of these problems. We describe an efficient MapReduce implementation of our new Competitive K-means algorithm that we found scales well with large datasets. We compared the performance of our new algorithm with three existing cluster analysis algorithms and found that our new algorithm improves cluster analysis accuracy and decreases variance. Our results show that our new algorithm produced a speedup of 76 ± 9 times compared with the serial K-means++ and is as fast as the Streaming K-means. Our work provides a method to select a good initial seeding in less time, facilitating accurate cluster analysis over large datasets in shorter time.
Keywords :
pattern clustering; very large databases; cluster analysis; cluster analysis techniques; competitive k-means algorithm; distributed k-means algorithm; large datasets; random algorithm; streaming k-means; Algorithm design and analysis; Approximation algorithms; Clustering algorithms; Computers; Google; Integrated circuits; Partitioning algorithms; K-means; K-means++; MapReduce; Streaming K-means;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Cloud Computing Technology and Science (CloudCom), 2013 IEEE 5th International Conference on
Conference_Location :
Bristol
Type :
conf
DOI :
10.1109/CloudCom.2013.89
Filename :
6753773
Link To Document :
بازگشت