DocumentCode
3697010
Title
Marriage Between Coordinated and Uncoordinated Checkpointing for the Exascale Era
Author
Omer Subasi;Ferad Zyulkyarov;Osman Unsal;Jesus Labarta
Author_Institution
Barcelona Supercomput. Center, Univ. Politec. de Catalunya, Barcelona, Spain
fYear
2015
Firstpage
470
Lastpage
478
Abstract
The state-of-the-art checkpointing techniques are projected to be prohibitively expensive in the Exascale era. These techniques are most often holistic in nature which prevents them to leverage programming model and paradigm specific advantages so as to be viable for the Exascale era. In this work, we present a unified non-hierarchical model to combine uncoordinated checkpointing with coordinated system-wide checkpointing to capitalize on programming model specific advantages. We develop closed-form formulas for performance improvement and the optimal checkpoint interval of the unified model in our analytical assessment. As an instantiation of our model, we propose to unify task-level checkpointing with a system-wide checkpointing scheme for task-parallel HPC applications. This instantiation has three distinct advantages: first it reduces performance overheads by decreasing the frequency of checkpoints in the unified system, second it features fast failure recovery by using in-memory task-local checkpoints instead of on-disk global checkpoints, and third it does not compromise from the high failure coverage typical of system-wide checkpointing.
Keywords
"Checkpointing","Parallel processing","Mathematical model","Performance gain","Fault tolerance","Fault tolerant systems"
Publisher
ieee
Conference_Titel
High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conferen on Embedded Software and Systems (ICESS), 2015 IEEE 17th International Conference on
Type
conf
DOI
10.1109/HPCC-CSS-ICESS.2015.150
Filename
7336204
Link To Document