DocumentCode :
2978313
Title :
A Synchronization-Induced Checkpoint Protocol for Group-Synchronous Parallel Programs
Author :
Zunce Wei ; Goswami, Debkalpa
Author_Institution :
Dept. of Comput. Sci. & Software Eng., Concordia Univ., Montreal, QC, Canada
fYear :
2012
fDate :
14-16 Dec. 2012
Firstpage :
632
Lastpage :
637
Abstract :
Group check pointing is a fix between global check pointing and log-based recovery. It features both reduced runtime overhead and localized recovery effect for improving the fault-tolerance performance of large-scale distributed systems. However, parallel programs cannot efficiently benefit from this strategy, as they often involve synchronous or semi-synchronous interactions that incur extra idling delays between processes as well as between process groups. This paper presents an analytical study on such delays and the corresponding delay optimization strategies. Observing that certain parallel programs exhibit patterns of "synchronization groups", we develop a Synchronization-Induced Checkpoint protocol that manages checkpoints around such groups. The protocol keeps advantages of ordinary group check pointing, and meanwhile minimizes the costs of synchronization-induced delays. We also broadly categorize the known synchronization patterns and establish their relations with suitable checkpoint strategies for parallel programs.
Keywords :
checkpointing; optimisation; parallel programming; software fault tolerance; synchronisation; delay optimization strategies; fault-tolerance performance; group checkpointing; group-synchronous parallel programs; large-scale distributed systems; localized recovery effect; log-based recovery; reduced runtime overhead; synchronization groups; synchronization-induced checkpoint protocol; synchronization-induced delays; Checkpointing; Delays; Fault tolerance; Pipelines; Protocols; Runtime; Synchronization; Fault tolerance; checkpoint and recovery; delay dependency; group checkpoint; synchronization group; synchronization-induced checkpoint;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Computing, Applications and Technologies (PDCAT), 2012 13th International Conference on
Conference_Location :
Beijing
Print_ISBN :
978-0-7695-4879-1
Type :
conf
DOI :
10.1109/PDCAT.2012.31
Filename :
6589351
Link To Document :
بازگشت