DocumentCode :
3079289
Title :
A Multilevel Fault-Tolerance Technique for the DAG Data Driven Model
Author :
Hao Fu ; Ce Yu ; Jizhou Sun ; Jun Du ; Mengmeng Wang
Author_Institution :
Sch. of Comput. Sci. & Technol., Tianjin Univ., Tianjin, China
fYear :
2015
fDate :
4-7 May 2015
Firstpage :
1127
Lastpage :
1130
Abstract :
Fault tolerance of hardware failure is a challenging work for parallel programming in massively parallel processing environment. However, traditional rollback-recovery techniques, which an be classified into checkpoint-based and log-based, would introduce extra overhead for recording an overall snapshot of an application. For a specialized programming model, a private recovery technique is valuable and can achieve a better performance.In this paper, a multilevel fault-tolerance technique designed for the DAG data driven model is proposed. It utilized the checkpoint-based fault tolerance technique for system recovery, and timeout to detect and revoery from performance faults. It consists of two kinds of checkpoints: the DAG pattern checkpoint and the intermediate result checkpoint. The DAG pattern checkpoint is designed for tracing the current processing progress of the DAG model, while the intermediate results checkpoint is used to record outputs of compute nodes. Moreover, we also implement this technique in the EasyHPS runtime system. Experimental results show that the check pointing overhead is as low as 2.6%.
Keywords :
checkpointing; directed graphs; parallel programming; software fault tolerance; DAG data driven model; EasyHPS runtime system; checkpoint-based fault tolerance technique; directed acyclic graph; hardware failure; multilevel fault-tolerance technique; parallel processing environment; parallel programming; private recovery technique; rollback-recovery techniques; Checkpointing; Computational modeling; Data models; Dynamic programming; Fault tolerance; Fault tolerant systems; Program processors; DAG data driven model; Fault tolerance; Multilevel Fault-tolerance Technique;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on
Conference_Location :
Shenzhen
Type :
conf
DOI :
10.1109/CCGrid.2015.89
Filename :
7152603
Link To Document :
بازگشت