• DocumentCode
    3147162
  • Title

    An Algorithm-Based Recovery Scheme for Extreme Scale Computing

  • Author

    Liu, Hui

  • Author_Institution
    Dept. of Math. & Comput. Sci., Colorado Sch. of Mines, Golden, CO, USA
  • fYear
    2011
  • fDate
    16-20 May 2011
  • Firstpage
    2010
  • Lastpage
    2013
  • Abstract
    We present an algorithm-based recovery scheme for Exascale computing, which uses both data dependencies and communication-induced redundancies of parallel codes to tolerate fault with low overhead. For some applications, our scheme significantly reduces checkpoint size and introduces no overhead when there is no actual failure in the computation. Fault tolerance Newton´s method by tailoring our scheme to the algorithm is performed. Numerical simulations indicate that our scheme introduces much less overhead than diskless check pointing does.
  • Keywords
    Newton method; fault tolerant computing; numerical analysis; system recovery; algorithm based recovery scheme; communication induced redundancies; data dependencies; diskless checkpointing; exascale computing; extreme scale computing; fault tolerance Newton method; numerical simulations; parallel codes; Checkpointing; Computers; Fault tolerance; Fault tolerant systems; Jacobian matrices; Newton method; Program processors;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on
  • Conference_Location
    Shanghai
  • ISSN
    1530-2075
  • Print_ISBN
    978-1-61284-425-1
  • Electronic_ISBN
    1530-2075
  • Type

    conf

  • DOI
    10.1109/IPDPS.2011.363
  • Filename
    6009077