• DocumentCode
    3145951
  • Title

    Algorithm-Based Recovery for Newton´s Method without Checkpointing

  • Author

    Liu, Hui ; Davies, Teresa ; Ding, Chong ; Karlsson, Christer ; Chen, Zizhong

  • Author_Institution
    Dept. of Math. & Comput. Sci., Colorado Sch. of Mines, Golden, CO, USA
  • fYear
    2011
  • fDate
    16-20 May 2011
  • Firstpage
    1541
  • Lastpage
    1548
  • Abstract
    Check pointing is the most popular fault tolerance method used in high-performance computing (HPC) systems. However, increasing failure rates requires more frequent checkpoints, thus makes check pointing more expensive. We present a checkpoint-free fault tolerance technique. It takes advantage of both data dependencies and communication-induced redundancies of parallel applications to tolerate fail-stop failures. Under the specified conditions, our technique introduces no additional overhead when there is no actual failure in the computation and recover the lost data with low overhead. We add fault-tolerant capacity to Newton´s method by using our scheme and diskless check pointing. Numerical simulations indicate that our scheme introduces much less overhead than diskless check pointing does.
  • Keywords
    Newton method; fault tolerance; parallel processing; system recovery; HPC system; Newton method; algorithm-based recovery; checkpoint-free fault tolerance technique; communication-induced redundancy; data dependencies; diskless check pointing; failure rate; high-performance computing; Checkpointing; Computers; Fault tolerant systems; Jacobian matrices; Nonlinear systems; Redundancy;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on
  • Conference_Location
    Shanghai
  • ISSN
    1530-2075
  • Print_ISBN
    978-1-61284-425-1
  • Electronic_ISBN
    1530-2075
  • Type

    conf

  • DOI
    10.1109/IPDPS.2011.309
  • Filename
    6009013