• DocumentCode
    228769
  • Title

    Understanding the Effects of Communication and Coordination on Checkpointing at Scale

  • Author

    Ferreira, Kurt B. ; Widener, Patrick ; Levy, Scott ; Arnold, Dorian ; Hoefler, Torsten

  • Author_Institution
    Scalable Syst. Software, Sandia Nat. Labs., Albuquerque, NM, USA
  • fYear
    2014
  • fDate
    16-21 Nov. 2014
  • Firstpage
    883
  • Lastpage
    894
  • Abstract
    Fault-tolerance poses a major challenge for future large-scale systems. Active research into coordinated, uncoordinated, and hybrid check pointing systems has explored how the introduction of asynchrony can address anticipated scalability issues. However, few insights into selection and tuning of these protocols for applications at scale have emerged. In this paper, we use a simulation-based approach to show that local checkpoint activity in resilience mechanisms can significantly affect the performance of key workloads, even when less than 1% of a local node´s compute time is allocated to resilience mechanisms (a very generous assumption). Specifically, we show that even though much work on uncoordinated check pointing has focused on optimizing message log volumes, local check pointing activity may dominate the overheads of this technique at scale. Our study shows that local checkpoints lead to process delays that can propagate through messaging relations to other processes causing a cascading series of delays. We demonstrate how to tune hierarchical uncoordinated check pointing protocols designed to reduce log volumes to significantly reduce these synchronization overheads at scale. Our work provides a critical analysis and comparison of coordinated and uncoordinated check pointing and enables users and system administrators to fine-tune the check pointing scheme to the application and system characteristics.
  • Keywords
    checkpointing; fault tolerant computing; synchronisation; anticipated scalability issues; communication effects; coordination effects; critical analysis; fault-tolerance; hierarchical uncoordinated checkpointing protocols; hybrid checkpointing systems; large-scale systems; local checkpoint activity; local node compute time; message log volume optimization; process delays; resilience mechanisms; simulation-based approach; synchronization overheads; system administrators; Checkpointing; Computational modeling; Delays; Mathematical model; Protocols; Resilience; Synchronization;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High Performance Computing, Networking, Storage and Analysis, SC14: International Conference for
  • Conference_Location
    New Orleans, LA
  • Print_ISBN
    978-1-4799-5499-5
  • Type

    conf

  • DOI
    10.1109/SC.2014.77
  • Filename
    7013059