DocumentCode
3147162
Title
An Algorithm-Based Recovery Scheme for Extreme Scale Computing
Author
Liu, Hui
Author_Institution
Dept. of Math. & Comput. Sci., Colorado Sch. of Mines, Golden, CO, USA
fYear
2011
fDate
16-20 May 2011
Firstpage
2010
Lastpage
2013
Abstract
We present an algorithm-based recovery scheme for Exascale computing, which uses both data dependencies and communication-induced redundancies of parallel codes to tolerate fault with low overhead. For some applications, our scheme significantly reduces checkpoint size and introduces no overhead when there is no actual failure in the computation. Fault tolerance Newton´s method by tailoring our scheme to the algorithm is performed. Numerical simulations indicate that our scheme introduces much less overhead than diskless check pointing does.
Keywords
Newton method; fault tolerant computing; numerical analysis; system recovery; algorithm based recovery scheme; communication induced redundancies; data dependencies; diskless checkpointing; exascale computing; extreme scale computing; fault tolerance Newton method; numerical simulations; parallel codes; Checkpointing; Computers; Fault tolerance; Fault tolerant systems; Jacobian matrices; Newton method; Program processors;
fLanguage
English
Publisher
ieee
Conference_Titel
Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on
Conference_Location
Shanghai
ISSN
1530-2075
Print_ISBN
978-1-61284-425-1
Electronic_ISBN
1530-2075
Type
conf
DOI
10.1109/IPDPS.2011.363
Filename
6009077
Link To Document