Title :
Cracking Down MapReduce Failure Amplification through Analytics Logging and Migration
Author :
Yandong Wang ; Huansong Fu ; Weikuan Yu
Author_Institution :
Dept. of Comput. Sci. & Software Eng., Auburn Univ., Auburn, AL, USA
Abstract :
MapReduce is popular for big data analytics because it offers easy-to-use map and reduce user interfaces while hiding the complexity of system scalability and fault resiliency issues. While a large body of literature has focused on improving the performance and scalability of MapReduce, the issue of fault resiliency has thus far received little attention. In this paper, we take on an effort to investigate the fault resiliency of MapReduce using YARN (the next-generation Hadoop) as a case study. We reveal that the failures of a MapTask, a ReduceTask or a compute node can cause distinctly different impact to MapReduce programs. Particularly, YARN MapReduce is not able to gracefully handle failures that involve ReduceTasks, causing prolonged task execution, delayed job completion, and, more severely, failure amplifications due to the cascading effects to other tasks. These problems together cause the performance collapse of MapReduce jobs. In this paper, we introduce a new fault-tolerant framework that can crack down failure amplification and gracefully handle failure scenarios. It is designed with two key fault handling techniques: analytics logging and speculative fast migration. Analytics logging is a light-weight mechanism that logs the key progress information of MapReduce tasks, speculative fast migration handles node failures by proactively re-executing MapTasks, migrating ReduceTasks, and collective merging with a pipeline of shuffle/merge and reduce stages. Our performance evaluation demonstrates that these techniques can eliminate failure amplification and deliver fast job execution compared to the existing task re-execution mechanism in MapReduce.
Keywords :
Big Data; parallel processing; software fault tolerance; Big Data analytics; MapReduce failure amplification; MapTask; ReduceTask; YARN; analytics logging fault handling techniques; fault resiliency; fault-tolerant framework; next-generation Hadoop; node failures; performance evaluation; speculative fast migration; system scalability complexity; task re-execution mechanism; user interface reduction; Computer crashes; Delays; Merging; Power system faults; Scalability; Scheduling; Yarn; Failure Amplicification; Fault Tolerance; MapReduce;
Conference_Titel :
Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International
Conference_Location :
Hyderabad
DOI :
10.1109/IPDPS.2015.111