DocumentCode
3145951
Title
Algorithm-Based Recovery for Newton´s Method without Checkpointing
Author
Liu, Hui ; Davies, Teresa ; Ding, Chong ; Karlsson, Christer ; Chen, Zizhong
Author_Institution
Dept. of Math. & Comput. Sci., Colorado Sch. of Mines, Golden, CO, USA
fYear
2011
fDate
16-20 May 2011
Firstpage
1541
Lastpage
1548
Abstract
Check pointing is the most popular fault tolerance method used in high-performance computing (HPC) systems. However, increasing failure rates requires more frequent checkpoints, thus makes check pointing more expensive. We present a checkpoint-free fault tolerance technique. It takes advantage of both data dependencies and communication-induced redundancies of parallel applications to tolerate fail-stop failures. Under the specified conditions, our technique introduces no additional overhead when there is no actual failure in the computation and recover the lost data with low overhead. We add fault-tolerant capacity to Newton´s method by using our scheme and diskless check pointing. Numerical simulations indicate that our scheme introduces much less overhead than diskless check pointing does.
Keywords
Newton method; fault tolerance; parallel processing; system recovery; HPC system; Newton method; algorithm-based recovery; checkpoint-free fault tolerance technique; communication-induced redundancy; data dependencies; diskless check pointing; failure rate; high-performance computing; Checkpointing; Computers; Fault tolerant systems; Jacobian matrices; Nonlinear systems; Redundancy;
fLanguage
English
Publisher
ieee
Conference_Titel
Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on
Conference_Location
Shanghai
ISSN
1530-2075
Print_ISBN
978-1-61284-425-1
Electronic_ISBN
1530-2075
Type
conf
DOI
10.1109/IPDPS.2011.309
Filename
6009013
Link To Document