DocumentCode
2984038
Title
Adaptive Failure Detection via Heartbeat under Hadoop
Author
Zhu, Hao ; Chen, Haopeng
Author_Institution
Sch. of Software, Shanghai Jiao Tong Univ., Shanghai, China
fYear
2011
fDate
12-15 Dec. 2011
Firstpage
231
Lastpage
238
Abstract
Hadoop has become one popular framework to process massive data sets in a large scale cluster. However, it is observed that the detection of the failed worker is delayed, which may result in a significant increase in the completion time of jobs with different workload. To cope with it, we present two mechanisms: Adaptive interval and Reputation-based Detector that support Hadoop to detect the failed worker in the shortest time. The Adaptive interval is trying to dynamically configure the expiration time which is adaptive to the job size. The Reputation-based Detector is trying to evaluate the reputation of each worker. Once the reputation of a worker is lower than a threshold, then the worker will be considered as a failed worker. In our experiments, we demonstrate that both of these strategies have achieved great improvement in the detection of the failed worker. Specifically, the Adaptive interval has a relatively better performance with small jobs, while the Reputation-based Detector is more suitable for large jobs.
Keywords
distributed programming; software fault tolerance; Hadoop; adaptive failure detection; adaptive interval; failed worker; job size; large scale cluster; massive data set; reputation-based detector; Detectors; Educational institutions; Fault tolerance; Fault tolerant systems; Heart beat; Heart rate variability; Runtime; Cloud computing; Hadoop; MapReduce; adaptive heartbeat; failure detection;
fLanguage
English
Publisher
ieee
Conference_Titel
Services Computing Conference (APSCC), 2011 IEEE Asia-Pacific
Conference_Location
Jeju Island
Print_ISBN
978-1-4673-0206-7
Type
conf
DOI
10.1109/APSCC.2011.46
Filename
6127967
Link To Document