Title :
Low-cost fault-tolerance in barrier synchronizations
Author :
Kulkarni, Sandeep S. ; Arora, Anish
Author_Institution :
Dept. of Comput. & Inf. Sci., Ohio State Univ., Columbus, OH, USA
Abstract :
We show how fault-tolerance can be effectively added to several types of faults in program computations that use barrier synchronization. We divide the faults that occur in practice into two classes, detectable and undetectable, and design a fully distributed program that tolerates the faults in both classes. Our program guarantees that every barrier is executed correctly even if detectable faults occur, and that eventually every barrier is executed correctly even if undetectable faults occur. Via analytical as well as simulation results we show that the cost of adding fault-tolerance is low, in part by comparing the times required by our program with that required by the corresponding fault-intolerant counterpart
Keywords :
message passing; parallel algorithms; parallel programming; software fault tolerance; synchronisation; barrier synchronizations; detectable faults; distributed program; low cost fault tolerance; message passing; program computations; simulation; undetectable faults; Analytical models; Communication standards; Concurrent computing; Costs; Fault detection; Fault tolerance; Information science; Message passing; Parallel algorithms; Workstations;
Conference_Titel :
Parallel Processing, 1998. Proceedings. 1998 International Conference on
Conference_Location :
Minneapolis, MN
Print_ISBN :
0-8186-8650-2
DOI :
10.1109/ICPP.1998.708472