DocumentCode :
249369
Title :
Federated MapReduce to Transparently Run Applications on Multicluster Environment
Author :
Chun-Yu Wang ; Tzu-Li Tai ; Shu Jui-Shing ; Jyh-Biau Chang ; Ce-Kuen Shieh
Author_Institution :
Dept. of Electr. Eng., Nat. Cheng Kung Univ., Tainan, Taiwan
fYear :
2014
fDate :
June 27 2014-July 2 2014
Firstpage :
296
Lastpage :
303
Abstract :
In the Cloud era, data is generated everywhere, how to efficiently analyze those "Big Data" that have properties such as large volume, fast generation, and variety, are most critical issues. MapReduce is a simplified distributed parallel data processing model. It has been widely applied in many areas such as web indexing, clustering and classification. However, when it confronted the sensitive data, such as network log or mails, which are distributed among independent organizations, these data must keep privacy and cannot be aggregated for centralized analyzing. We propose Federated MapReduce (Fed-MR), a framework aimed at analyzing geometrically distributed data among independent organizations while avoiding data movement. In contrast to previous works, Fed-MR retains the simplicity of MapReduce programming eto provide a transparent way to run original MapReduce jobs across multiple clusters without any extra programming burden. Fed-MR also integrates multiple clusters in different locations to form hierarchical Top-Region relationships. Experiments, compared to a single cluster with the same number of worker nodes, had shown that the computation time was only increased by an average of 30% in WordCount and 10% in Grep. Therefore, Fed-MR has reasonable overheads in performance for analyzing data across Internet-connected clusters while no additional Global Reduce function was required as in traditional hierarchical MapReduce frameworks.
Keywords :
Big Data; cloud computing; distributed databases; parallel programming; Big Data; Fed-MR; Federated MapReduce; Grep; Internet-connected clusters; MapReduce jobs; MapReduce programming; WordCount; centralized analysis; cloud computing; computation time; data generation; data movement; data privacy; data variety; data volume; geometrically distributed parallel data processing model; hierarchical top-region relationships; mails; multicluster environment; network log; performance overheads; sensitive data; worker nodes; Cloud computing; Collaboration; Data models; Distributed databases; File systems; Programming; Hadoop; Hierarchical MapReduce; Multicluster;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Big Data (BigData Congress), 2014 IEEE International Congress on
Conference_Location :
Anchorage, AK
Print_ISBN :
978-1-4799-5056-0
Type :
conf
DOI :
10.1109/BigData.Congress.2014.50
Filename :
6906793
Link To Document :
بازگشت