Title :
An Algorithm-Based Recovery Scheme for Extreme Scale Computing
Author_Institution :
Dept. of Math. & Comput. Sci., Colorado Sch. of Mines, Golden, CO, USA
Abstract :
We present an algorithm-based recovery scheme for Exascale computing, which uses both data dependencies and communication-induced redundancies of parallel codes to tolerate fault with low overhead. For some applications, our scheme significantly reduces checkpoint size and introduces no overhead when there is no actual failure in the computation. Fault tolerance Newton´s method by tailoring our scheme to the algorithm is performed. Numerical simulations indicate that our scheme introduces much less overhead than diskless check pointing does.
Keywords :
Newton method; fault tolerant computing; numerical analysis; system recovery; algorithm based recovery scheme; communication induced redundancies; data dependencies; diskless checkpointing; exascale computing; extreme scale computing; fault tolerance Newton method; numerical simulations; parallel codes; Checkpointing; Computers; Fault tolerance; Fault tolerant systems; Jacobian matrices; Newton method; Program processors;
Conference_Titel :
Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on
Conference_Location :
Shanghai
Print_ISBN :
978-1-61284-425-1
Electronic_ISBN :
1530-2075
DOI :
10.1109/IPDPS.2011.363