Title :
Efficient
-Means++ Approximation with MapReduce
Author :
Yujie Xu ; Wenyu Qu ; Zhiyang Li ; Geyong Min ; Keqiu Li ; Zhaobin Liu
Author_Institution :
Sch. of Inf. Sci. & Technol., Dalian Maritime Univ., Dalian, China
Abstract :
k-means is undoubtedly one of the most popular clustering algorithms owing to its simplicity and efficiency. However, this algorithm is highly sensitive to the chosen initial centers and thus a proper initialization is crucial for obtaining an ideal solution. To address this problem, k-means++ is proposed to sequentially choose the centers so as to achieve a solution that is provably close to the optimal one. However, due to its weak scalability, k-means++ becomes inefficient as the size of data increases. To improve its scalability and efficiency, this paper presents Map Reduce k-means++ method which can drastically reduce the number of Map Reduce jobs by using only one MapReduce job to obtain k centers. The k-means++ initialization algorithm is executed in the Mapper phase and the weighted k-means++ initialization algorithm is run in the Reducer phase. As this new Map Reduce k-means++ method replaces the iterations among multiple machines with a single machine, it can reduce the communication and I/O costs significantly. We also prove that the proposed Map Reduce k-means++ method obtains O(α 2)approximation to the optimal solution of k-means. To reduce the expensive distance computation of the proposed method, we further propose a pruning strategy that can greatly avoid a large number of redundant distance computations. Extensive experiments on real and synthetic data are conducted and the performance results indicate that the proposed Map Reduce k-means++ method is much more efficient and can achieve a good approximation.
Keywords :
approximation theory; pattern clustering; I/O cost reduction; MapReduce k-means++ method; clustering algorithms; communication cost reduction; k-means++ approximation; mapper phase; pruning strategy; reducer phase; redundant distance computations; weighted k-means++ initialization algorithm; Algorithm design and analysis; Approximation algorithms; Approximation methods; Clustering algorithms; Educational institutions; Scalability; Standards; Clustering algorithms; MapReduce; approximation; k-means; k-means++; scalability;
Journal_Title :
Parallel and Distributed Systems, IEEE Transactions on
DOI :
10.1109/TPDS.2014.2306193