DocumentCode :
189247
Title :
Multiple Parallel MapReduce k-Means Clustering with Validation and Selection
Author :
Dearo Garcia, Kemilly ; Coelho Naldi, Murilo
Author_Institution :
Dept. of Exact & Technol. Sci., Univ. Fed. de Vicosa - UFV, Rio Paranaıba, Brazil
fYear :
2014
fDate :
18-22 Oct. 2014
Firstpage :
432
Lastpage :
437
Abstract :
Dealing with big amounts of data is one of the challenges for clustering, which causes the need for distribution and management of huge data sets in separate repositories. New distributed systems have been designed to scale up from a single server to thousands of machines. The MapReduce framework allows to divide a job and combine the results seamlessly. The k-means is one of the few clustering algorithms that satisfies the MapReduce constrains, but it requires the previous specification of the number of clusters and is sensitive to their initialization. In this work, we propose a MapReduce clustering algorithm to execute multiple parallel runs of k-means with different initializations and number of clusters. Additionally, a MapReduce version of a cluster relative validity index is implemented and used to find the best result. The proposed algorithm is experimentally compared with the Apache Mahout Project´s MapReduce implementation of k-means. Statistical tests applied on the results indicate that the proposed algorithm can outperform the Mahout´s implementation when multiple k-means partitions are required.
Keywords :
data handling; parallel programming; pattern clustering; statistical testing; Apache Mahout Project MapReduce implementation; MapReduce clustering algorithm; MapReduce constraint; cluster initialization; cluster number; cluster relative validity index; data repositories; data selection; data set distribution; data set management; data validation; distributed systems; multiple k-means partitioning; multiple parallel MapReduce k-mean clustering; parallel k-means runs; statistical tests; Big data; Clustering algorithms; Data structures; Indexes; Parallel processing; Partitioning algorithms; Vectors; Cluster Selection; Clustering Validation; MapReduce Clustering; k-means;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Intelligent Systems (BRACIS), 2014 Brazilian Conference on
Conference_Location :
Sao Paulo
Type :
conf
DOI :
10.1109/BRACIS.2014.83
Filename :
6984869
Link To Document :
بازگشت