• DocumentCode
    598578
  • Title

    Design and modeling of a non-blocking checkpointing system

  • Author

    Sato, Kiminori ; Mohror, Kathryn ; Moody, Adam ; Gamblin, Todd ; de Supinski, Bronis R. ; Maruyama, Naoya ; Matsuoka, Shingo

  • Author_Institution
    Dept. of Math. & Comput. Sci., Tokyo Inst. of Technol., Tokyo, Japan
  • fYear
    2012
  • fDate
    10-16 Nov. 2012
  • Firstpage
    1
  • Lastpage
    10
  • Abstract
    As the capability and component count of systems increase, the MTBF decreases. Typically, applications tolerate failures with checkpoint/restart to a parallel file system (PFS). While simple, this approach can suffer from contention for PFS resources. Multi-level checkpointing is a promising solution. However, while multi-level checkpointing is successful on today´s machines, it is not expected to be sufficient for exascale class machines, which are predicted to have orders of magnitude larger memory sizes and failure rates. Our solution combines the benefits of non-blocking and multi-level checkpointing. In this paper, we present the design of our system and model its performance. Our experiments show that our system can improve efficiency by 1.1 to 2.0x on future machines. Additionally, applications using our checkpointing system can achieve high efficiency even when using a PFS with lower bandwidth.
  • Keywords
    checkpointing; parallel databases; software fault tolerance; software performance evaluation; MTBF; PFS resources; efficiency improvement; exascale class machines; failure tolerance; multilevel checkpointing; nonblocking checkpointing system design; nonblocking checkpointing system modeling; parallel file system; performance modeling; restart; system capability; system component count; Checkpointing; Computational modeling; Libraries; Redundancy; Servers; Switches; Thyristors; Checkpoint/Restart; Fault tolerance; Markov model;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for
  • Conference_Location
    Salt Lake City, UT
  • ISSN
    2167-4329
  • Print_ISBN
    978-1-4673-0805-2
  • Type

    conf

  • DOI
    10.1109/SC.2012.46
  • Filename
    6468461