• DocumentCode
    3428271
  • Title

    Alleviating scalability issues of checkpointing protocols

  • Author

    Riesen, R. ; Ferreira, K. ; da Silva, Dilma ; Lemarinier, Pierre ; Arnold, Dorian ; Bridges, Patrick G.

  • Author_Institution
    IBM Res., Dublin, Ireland
  • fYear
    2012
  • fDate
    10-16 Nov. 2012
  • Firstpage
    1
  • Lastpage
    11
  • Abstract
    Current fault tolerance protocols are not sufficiently scalable for the exascale era. The most-widely used method, coordinated checkpointing, places enormous demands on the I/O subsystem and imposes frequent synchronizations. Uncoordinated protocols use message logging which introduces message rate limitations or undesired memory and storage requirements to hold payload and event logs. In this paper we propose a combination of several techniques, namely coordinated checkpointing, optimistic message logging, and a protocol that glues them together. This combination eliminates some of the drawbacks of each individual approach and proves to be an alternative for many types of exascale applications. We evaluate performance and scaling characteristics of this combination using simulation and a partial implementation. While not a universal solution, the combined protocol is suitable for a large range of existing and future applications that use coordinated checkpointing and enhances their scalability.
  • Keywords
    checkpointing; digital simulation; input-output programs; parallel processing; protocols; software performance evaluation; storage management; system monitoring; I/O subsystem; coordinated checkpointing protocol; exascale applications; optimistic message logging; partial implementation; performance evaluation; scalability; scaling characteristics; simulation; Checkpointing; Computational modeling; Economic indicators; Fault tolerance; Payloads; Protocols; Radiation detectors;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for
  • Conference_Location
    Salt Lake City, UT
  • ISSN
    2167-4329
  • Print_ISBN
    978-1-4673-0805-2
  • Type

    conf

  • DOI
    10.1109/SC.2012.18
  • Filename
    6468460