DocumentCode
3079289
Title
A Multilevel Fault-Tolerance Technique for the DAG Data Driven Model
Author
Hao Fu ; Ce Yu ; Jizhou Sun ; Jun Du ; Mengmeng Wang
Author_Institution
Sch. of Comput. Sci. & Technol., Tianjin Univ., Tianjin, China
fYear
2015
fDate
4-7 May 2015
Firstpage
1127
Lastpage
1130
Abstract
Fault tolerance of hardware failure is a challenging work for parallel programming in massively parallel processing environment. However, traditional rollback-recovery techniques, which an be classified into checkpoint-based and log-based, would introduce extra overhead for recording an overall snapshot of an application. For a specialized programming model, a private recovery technique is valuable and can achieve a better performance.In this paper, a multilevel fault-tolerance technique designed for the DAG data driven model is proposed. It utilized the checkpoint-based fault tolerance technique for system recovery, and timeout to detect and revoery from performance faults. It consists of two kinds of checkpoints: the DAG pattern checkpoint and the intermediate result checkpoint. The DAG pattern checkpoint is designed for tracing the current processing progress of the DAG model, while the intermediate results checkpoint is used to record outputs of compute nodes. Moreover, we also implement this technique in the EasyHPS runtime system. Experimental results show that the check pointing overhead is as low as 2.6%.
Keywords
checkpointing; directed graphs; parallel programming; software fault tolerance; DAG data driven model; EasyHPS runtime system; checkpoint-based fault tolerance technique; directed acyclic graph; hardware failure; multilevel fault-tolerance technique; parallel processing environment; parallel programming; private recovery technique; rollback-recovery techniques; Checkpointing; Computational modeling; Data models; Dynamic programming; Fault tolerance; Fault tolerant systems; Program processors; DAG data driven model; Fault tolerance; Multilevel Fault-tolerance Technique;
fLanguage
English
Publisher
ieee
Conference_Titel
Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on
Conference_Location
Shenzhen
Type
conf
DOI
10.1109/CCGrid.2015.89
Filename
7152603
Link To Document