Understanding the Effects of Communication and Coordination on Checkpointing at Scale

Author

Ferreira, Kurt B. ; Widener, Patrick ; Levy, Scott ; Arnold, Dorian ; Hoefler, Torsten

Author_Institution

Scalable Syst. Software, Sandia Nat. Labs., Albuquerque, NM, USA

fYear

2014

fDate

16-21 Nov. 2014

Firstpage

883

Lastpage

894

Abstract

Fault-tolerance poses a major challenge for future large-scale systems. Active research into coordinated, uncoordinated, and hybrid check pointing systems has explored how the introduction of asynchrony can address anticipated scalability issues. However, few insights into selection and tuning of these protocols for applications at scale have emerged. In this paper, we use a simulation-based approach to show that local checkpoint activity in resilience mechanisms can significantly affect the performance of key workloads, even when less than 1% of a local node´s compute time is allocated to resilience mechanisms (a very generous assumption). Specifically, we show that even though much work on uncoordinated check pointing has focused on optimizing message log volumes, local check pointing activity may dominate the overheads of this technique at scale. Our study shows that local checkpoints lead to process delays that can propagate through messaging relations to other processes causing a cascading series of delays. We demonstrate how to tune hierarchical uncoordinated check pointing protocols designed to reduce log volumes to significantly reduce these synchronization overheads at scale. Our work provides a critical analysis and comparison of coordinated and uncoordinated check pointing and enables users and system administrators to fine-tune the check pointing scheme to the application and system characteristics.

Keywords

checkpointing; fault tolerant computing; synchronisation; anticipated scalability issues; communication effects; coordination effects; critical analysis; fault-tolerance; hierarchical uncoordinated checkpointing protocols; hybrid checkpointing systems; large-scale systems; local checkpoint activity; local node compute time; message log volume optimization; process delays; resilience mechanisms; simulation-based approach; synchronization overheads; system administrators; Checkpointing; Computational modeling; Delays; Mathematical model; Protocols; Resilience; Synchronization;

fLanguage

English

Publisher

ieee

Conference_Titel

High Performance Computing, Networking, Storage and Analysis, SC14: International Conference for

Conference_Location

New Orleans, LA

Print_ISBN

978-1-4799-5499-5

Type

conf

DOI

10.1109/SC.2014.77

Filename

7013059