DocumentCode
228769
Title
Understanding the Effects of Communication and Coordination on Checkpointing at Scale
Author
Ferreira, Kurt B. ; Widener, Patrick ; Levy, Scott ; Arnold, Dorian ; Hoefler, Torsten
Author_Institution
Scalable Syst. Software, Sandia Nat. Labs., Albuquerque, NM, USA
fYear
2014
fDate
16-21 Nov. 2014
Firstpage
883
Lastpage
894
Abstract
Fault-tolerance poses a major challenge for future large-scale systems. Active research into coordinated, uncoordinated, and hybrid check pointing systems has explored how the introduction of asynchrony can address anticipated scalability issues. However, few insights into selection and tuning of these protocols for applications at scale have emerged. In this paper, we use a simulation-based approach to show that local checkpoint activity in resilience mechanisms can significantly affect the performance of key workloads, even when less than 1% of a local node´s compute time is allocated to resilience mechanisms (a very generous assumption). Specifically, we show that even though much work on uncoordinated check pointing has focused on optimizing message log volumes, local check pointing activity may dominate the overheads of this technique at scale. Our study shows that local checkpoints lead to process delays that can propagate through messaging relations to other processes causing a cascading series of delays. We demonstrate how to tune hierarchical uncoordinated check pointing protocols designed to reduce log volumes to significantly reduce these synchronization overheads at scale. Our work provides a critical analysis and comparison of coordinated and uncoordinated check pointing and enables users and system administrators to fine-tune the check pointing scheme to the application and system characteristics.
Keywords
checkpointing; fault tolerant computing; synchronisation; anticipated scalability issues; communication effects; coordination effects; critical analysis; fault-tolerance; hierarchical uncoordinated checkpointing protocols; hybrid checkpointing systems; large-scale systems; local checkpoint activity; local node compute time; message log volume optimization; process delays; resilience mechanisms; simulation-based approach; synchronization overheads; system administrators; Checkpointing; Computational modeling; Delays; Mathematical model; Protocols; Resilience; Synchronization;
fLanguage
English
Publisher
ieee
Conference_Titel
High Performance Computing, Networking, Storage and Analysis, SC14: International Conference for
Conference_Location
New Orleans, LA
Print_ISBN
978-1-4799-5499-5
Type
conf
DOI
10.1109/SC.2014.77
Filename
7013059
Link To Document