• DocumentCode
    3147671
  • Title

    Dependable initialization of large-scale distributed software

  • Author

    Ren, Yansong Jennifer ; Buskens, Rick ; Gonzalez, Oscar

  • Author_Institution
    Bell Labs., Lucent Technol., Murray Hill, NJ, USA
  • fYear
    2004
  • fDate
    28 June-1 July 2004
  • Firstpage
    335
  • Lastpage
    344
  • Abstract
    Most documented efforts in fault-tolerant computing address the problem of recovering from failures that occur during normal system operation. To bring a system to a point where it can begin performing its duties first requires that the system successfully complete initialization. Large-scale distributed systems may take hours to initialize. For such systems, a key challenge is tolerating failures that occur during initialization, while still completing initialization in a timely manner. In this paper, we present a dependable initialization model that captures the architecture of the system to be initialized, as well as interdependencies among system components. We show that overall system initialization may sometimes complete more quickly if recovery actions are deferred as opposed to commencing recovery actions as soon as a failure is detected. This observation leads us to introduce a recovery decision function that dynamically assesses when to take recovery actions. We then describe a dependable initialization algorithm that combines the dependable initialization model and the recovery decision function for achieving fast initialization. Experimental results show that our algorithm incurs lower initialization overhead than that of a conventional initialization algorithm. This work is the first effort we are aware of that formally studies the challenges of initializing a distributed system in the presence of failures.
  • Keywords
    distributed processing; fault tolerant computing; large-scale systems; system recovery; distributed software; failure detection; failure recovery; fault-tolerant computing; initialization overhead; large-scale systems; recovery decision function; Checkpointing; Communication channels; Databases; Fault tolerance; Fault tolerant systems; Grid computing; Large-scale systems;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Dependable Systems and Networks, 2004 International Conference on
  • Print_ISBN
    0-7695-2052-9
  • Type

    conf

  • DOI
    10.1109/DSN.2004.1311903
  • Filename
    1311903