مرکز منطقه ای اطلاع رساني علوم و فناوري - Efficient <inline-formula> <img src="/images/tex/348.gif" alt="k"> <alternatives> <inline-graphic xlink:type="simple" xlink:href="qu-ieq1-2306193.gif"/></alternatives></inline-formula>-Means++ Approximation with MapReduce

DocumentCode :

1765746

Title :

Efficient $k$

-Means++ Approximation with MapReduce

Author :

Yujie Xu ; Wenyu Qu ; Zhiyang Li ; Geyong Min ; Keqiu Li ; Zhaobin Liu

Author_Institution :

Sch. of Inf. Sci. & Technol., Dalian Maritime Univ., Dalian, China

Volume :

Issue :

fYear :

2014

fDate :

Dec. 1 2014

Firstpage :

3135

Lastpage :

3144

Abstract :

k-means is undoubtedly one of the most popular clustering algorithms owing to its simplicity and efficiency. However, this algorithm is highly sensitive to the chosen initial centers and thus a proper initialization is crucial for obtaining an ideal solution. To address this problem, k-means++ is proposed to sequentially choose the centers so as to achieve a solution that is provably close to the optimal one. However, due to its weak scalability, k-means++ becomes inefficient as the size of data increases. To improve its scalability and efficiency, this paper presents Map Reduce k-means++ method which can drastically reduce the number of Map Reduce jobs by using only one MapReduce job to obtain k centers. The k-means++ initialization algorithm is executed in the Mapper phase and the weighted k-means++ initialization algorithm is run in the Reducer phase. As this new Map Reduce k-means++ method replaces the iterations among multiple machines with a single machine, it can reduce the communication and I/O costs significantly. We also prove that the proposed Map Reduce k-means++ method obtains O(α ²)approximation to the optimal solution of k-means. To reduce the expensive distance computation of the proposed method, we further propose a pruning strategy that can greatly avoid a large number of redundant distance computations. Extensive experiments on real and synthetic data are conducted and the performance results indicate that the proposed Map Reduce k-means++ method is much more efficient and can achieve a good approximation.

Keywords :

approximation theory; pattern clustering; I/O cost reduction; MapReduce k-means++ method; clustering algorithms; communication cost reduction; k-means++ approximation; mapper phase; pruning strategy; reducer phase; redundant distance computations; weighted k-means++ initialization algorithm; Algorithm design and analysis; Approximation algorithms; Approximation methods; Clustering algorithms; Educational institutions; Scalability; Standards; Clustering algorithms; MapReduce; approximation; k-means; k-means++; scalability;

fLanguage :

English

Journal_Title :

Parallel and Distributed Systems, IEEE Transactions on

Publisher :

ieee

ISSN :

1045-9219

Type :

jour

DOI :

10.1109/TPDS.2014.2306193

Filename :

6740074

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1765746