DocumentCode
3147671
Title
Dependable initialization of large-scale distributed software
Author
Ren, Yansong Jennifer ; Buskens, Rick ; Gonzalez, Oscar
Author_Institution
Bell Labs., Lucent Technol., Murray Hill, NJ, USA
fYear
2004
fDate
28 June-1 July 2004
Firstpage
335
Lastpage
344
Abstract
Most documented efforts in fault-tolerant computing address the problem of recovering from failures that occur during normal system operation. To bring a system to a point where it can begin performing its duties first requires that the system successfully complete initialization. Large-scale distributed systems may take hours to initialize. For such systems, a key challenge is tolerating failures that occur during initialization, while still completing initialization in a timely manner. In this paper, we present a dependable initialization model that captures the architecture of the system to be initialized, as well as interdependencies among system components. We show that overall system initialization may sometimes complete more quickly if recovery actions are deferred as opposed to commencing recovery actions as soon as a failure is detected. This observation leads us to introduce a recovery decision function that dynamically assesses when to take recovery actions. We then describe a dependable initialization algorithm that combines the dependable initialization model and the recovery decision function for achieving fast initialization. Experimental results show that our algorithm incurs lower initialization overhead than that of a conventional initialization algorithm. This work is the first effort we are aware of that formally studies the challenges of initializing a distributed system in the presence of failures.
Keywords
distributed processing; fault tolerant computing; large-scale systems; system recovery; distributed software; failure detection; failure recovery; fault-tolerant computing; initialization overhead; large-scale systems; recovery decision function; Checkpointing; Communication channels; Databases; Fault tolerance; Fault tolerant systems; Grid computing; Large-scale systems;
fLanguage
English
Publisher
ieee
Conference_Titel
Dependable Systems and Networks, 2004 International Conference on
Print_ISBN
0-7695-2052-9
Type
conf
DOI
10.1109/DSN.2004.1311903
Filename
1311903
Link To Document