DocumentCode
249369
Title
Federated MapReduce to Transparently Run Applications on Multicluster Environment
Author
Chun-Yu Wang ; Tzu-Li Tai ; Shu Jui-Shing ; Jyh-Biau Chang ; Ce-Kuen Shieh
Author_Institution
Dept. of Electr. Eng., Nat. Cheng Kung Univ., Tainan, Taiwan
fYear
2014
fDate
June 27 2014-July 2 2014
Firstpage
296
Lastpage
303
Abstract
In the Cloud era, data is generated everywhere, how to efficiently analyze those "Big Data" that have properties such as large volume, fast generation, and variety, are most critical issues. MapReduce is a simplified distributed parallel data processing model. It has been widely applied in many areas such as web indexing, clustering and classification. However, when it confronted the sensitive data, such as network log or mails, which are distributed among independent organizations, these data must keep privacy and cannot be aggregated for centralized analyzing. We propose Federated MapReduce (Fed-MR), a framework aimed at analyzing geometrically distributed data among independent organizations while avoiding data movement. In contrast to previous works, Fed-MR retains the simplicity of MapReduce programming eto provide a transparent way to run original MapReduce jobs across multiple clusters without any extra programming burden. Fed-MR also integrates multiple clusters in different locations to form hierarchical Top-Region relationships. Experiments, compared to a single cluster with the same number of worker nodes, had shown that the computation time was only increased by an average of 30% in WordCount and 10% in Grep. Therefore, Fed-MR has reasonable overheads in performance for analyzing data across Internet-connected clusters while no additional Global Reduce function was required as in traditional hierarchical MapReduce frameworks.
Keywords
Big Data; cloud computing; distributed databases; parallel programming; Big Data; Fed-MR; Federated MapReduce; Grep; Internet-connected clusters; MapReduce jobs; MapReduce programming; WordCount; centralized analysis; cloud computing; computation time; data generation; data movement; data privacy; data variety; data volume; geometrically distributed parallel data processing model; hierarchical top-region relationships; mails; multicluster environment; network log; performance overheads; sensitive data; worker nodes; Cloud computing; Collaboration; Data models; Distributed databases; File systems; Programming; Hadoop; Hierarchical MapReduce; Multicluster;
fLanguage
English
Publisher
ieee
Conference_Titel
Big Data (BigData Congress), 2014 IEEE International Congress on
Conference_Location
Anchorage, AK
Print_ISBN
978-1-4799-5056-0
Type
conf
DOI
10.1109/BigData.Congress.2014.50
Filename
6906793
Link To Document