Federated MapReduce to Transparently Run Applications on Multicluster Environment

Author

Chun-Yu Wang ; Tzu-Li Tai ; Shu Jui-Shing ; Jyh-Biau Chang ; Ce-Kuen Shieh

Author_Institution

Dept. of Electr. Eng., Nat. Cheng Kung Univ., Tainan, Taiwan

fYear

2014

fDate

June 27 2014-July 2 2014

Firstpage

296

Lastpage

303

Abstract

In the Cloud era, data is generated everywhere, how to efficiently analyze those "Big Data" that have properties such as large volume, fast generation, and variety, are most critical issues. MapReduce is a simplified distributed parallel data processing model. It has been widely applied in many areas such as web indexing, clustering and classification. However, when it confronted the sensitive data, such as network log or mails, which are distributed among independent organizations, these data must keep privacy and cannot be aggregated for centralized analyzing. We propose Federated MapReduce (Fed-MR), a framework aimed at analyzing geometrically distributed data among independent organizations while avoiding data movement. In contrast to previous works, Fed-MR retains the simplicity of MapReduce programming eto provide a transparent way to run original MapReduce jobs across multiple clusters without any extra programming burden. Fed-MR also integrates multiple clusters in different locations to form hierarchical Top-Region relationships. Experiments, compared to a single cluster with the same number of worker nodes, had shown that the computation time was only increased by an average of 30% in WordCount and 10% in Grep. Therefore, Fed-MR has reasonable overheads in performance for analyzing data across Internet-connected clusters while no additional Global Reduce function was required as in traditional hierarchical MapReduce frameworks.

Keywords

Big Data; cloud computing; distributed databases; parallel programming; Big Data; Fed-MR; Federated MapReduce; Grep; Internet-connected clusters; MapReduce jobs; MapReduce programming; WordCount; centralized analysis; cloud computing; computation time; data generation; data movement; data privacy; data variety; data volume; geometrically distributed parallel data processing model; hierarchical top-region relationships; mails; multicluster environment; network log; performance overheads; sensitive data; worker nodes; Cloud computing; Collaboration; Data models; Distributed databases; File systems; Programming; Hadoop; Hierarchical MapReduce; Multicluster;

fLanguage

English

Publisher

ieee

Conference_Titel

Big Data (BigData Congress), 2014 IEEE International Congress on

Conference_Location

Anchorage, AK

Print_ISBN

978-1-4799-5056-0

Type

conf

DOI

10.1109/BigData.Congress.2014.50

Filename

6906793