Improving MapReduce fault tolerance in the cloud

Author

Zheng, Qin

Author_Institution

Adv. Comput. Programme, Inst. of High Performance Comput., Singapore, Singapore

fYear

2010

fDate

19-23 April 2010

Firstpage

1

Lastpage

6

Abstract

MapReduce has been used at Google, Yahoo, FaceBook etc., even for their production jobs. However, according to a recent study, a single failure on a Hadoop job could cause a 50% increase in completion time. Amazon Elastic MapReduce has been provided to help users perform data-intensive tasks for their applications. These applications may have high fault tolerance and/or tight SLA requirements. However, MapReduce fault tolerance in the cloud is more challenging as topology control and (data) rack locality currently are not possible. In this paper, we investigate how redundent copies can be provisioned for tasks to improve MapReduce fault tolerance in the cloud while reducing latency.

Keywords

Internet; Web sites; fault tolerant computing; Amazon Elastic MapReduce; Hadoop job; MapReduce fault tolerance; SLA requirement; Availability; Cloud computing; Delay; Disk drives; Facebook; Fault tolerance; High performance computing; Job production systems; Open source software; Topology; MapReduce; backup; fault tolerance; scheduling;

fLanguage

English

Publisher

ieee

Conference_Titel

Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on

Conference_Location

Atlanta, GA

Print_ISBN

978-1-4244-6533-0

Type

conf

DOI

10.1109/IPDPSW.2010.5470865

Filename

5470865