DocumentCode :
3026143
Title :
The effect of different failure recovery procedures on the distribution of task completion times
Author :
Sheahan, Robert ; Lipsky, Lester ; Fiorini, Pierre
Author_Institution :
Dept. of Comput. Sci. & Eng., Connecticut Univ., Storrs, CT, USA
fYear :
2005
fDate :
4-8 April 2005
Abstract :
For a system to be reliable, it must have one or more methods of dealing with failures. Distributed systems face both node failure and communication channel failure. Communication channels, in particular, may suffer failures at a very high rate. Different systems respond to task failure in different ways. The system may resume a failed task from the failure point (or a saved checkpoint shortly before the failure point), it may restart the task, or it may give up on the task and select a replacement task from the ready queue. These three responses to failure all change the distribution of task completion times. The distribution of completion times is important because it governs mean service time and queue length, and therefore quality of service and buffer size necessary to manage the risk of overflow. The changes to the distribution introduced by the failure response can even turn well behaved exponentially distributed times into powertail distributed times with infinite mean and variance. In this paper we examine the characteristics of distributions that result from restarting after each interrupt, with some discussion of resume and replace, for comparison. We provide analytic and simulation solutions.
Keywords :
buffer storage; checkpointing; failure analysis; quality of service; queueing theory; statistical distributions; buffer size; communication channel failure; distributed system; exponentially distributed times; failure recovery procedures; mean service time; powertail distributed times; quality of service; task completion time distribution; Analytical models; Checkpointing; Communication channels; Computer science; Failure analysis; Quality management; Quality of service; Reliability engineering; Resumes; Risk management;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International
Print_ISBN :
0-7695-2312-9
Type :
conf
DOI :
10.1109/IPDPS.2005.426
Filename :
1420245
Link To Document :
بازگشت