Title :
Identifying patterns towards Algorithm Based Fault Tolerance
Author :
Kabir, Upama ; Goswami, Dhrubajyoti
Author_Institution :
Dept. of Comput. Sci. & Software Eng., Concordia Univ., Montreal, QC, Canada
Abstract :
Checkpoint and recovery cost imposed by coordinated checkpoint/restart (CCP/R) is a crucial performance issue for high performance computing (HPC) applications. In comparison, Algorithm Based Fault Tolerance (ABFT) is a promising fault tolerance method with low recovery overhead, but it suffers from inadequacy of universal applicability and user non-transparency. In this paper we address the overhead problem of CCP/R and some of the limitations of ABFT, and propose a solution for ABFT based on algorithmic patterns. The proposed solution is a generic fault tolerance strategy for a group of applications that exhibit similar algorithmic (structural and behavioral) features. These features together with the minimal fault recovery data (critical data) determine the fault tolerance strategy for the group of applications. We call this strategy a fault tolerance pattern (FTP). We demonstrate the idea of FTP with parallel iterative deepening A* (PIDA*) search, a generic search algorithm used to solve a wide range of discrete optimization problems (DOP). Theoretical analysis shows that our proposed solution performs better than CCP/R in terms of checkpoint and recovery time overhead. Furthermore, using FTP helps in separation of concerns, which facilitates user transparency.
Keywords :
checkpointing; fault tolerant computing; optimisation; parallel algorithms; search problems; ABFT; CCP/R; DOP; FTP; PIDA* search; algorithm based fault tolerance; coordinated checkpoint/restart; discrete optimization problems; fault tolerance pattern; generic search algorithm; parallel iterative deepening A* search; pattern identification; Algorithm design and analysis; Fault tolerance; Fault tolerant systems; Kernel; Program processors; Protocols; Search problems; algorithm based fault tolerance; fault tolerant parallel programs; framework for fault tolerance; parallel algorithmic patterns; patterns for fault tolerance;
Conference_Titel :
High Performance Computing & Simulation (HPCS), 2015 International Conference on
Conference_Location :
Amsterdam
Print_ISBN :
978-1-4673-7812-3
DOI :
10.1109/HPCSim.2015.7237083