DocumentCode :
1824245
Title :
Checkpointing and recovery for distributed shared memory applications
Author :
Ouyang, Jinsong ; Heiser, Gernot
Author_Institution :
Sch. of Comput. Sci. & Eng., New South Wales Univ., Kensington, NSW, Australia
fYear :
1995
fDate :
14-15 Aug 1995
Firstpage :
191
Lastpage :
199
Abstract :
The paper proposes an approach for adding fault tolerance, based on consistent checkpointing, to distributed shared memory applications. Two different mechanisms are presented to efficiently address the issue of message losses due to either site failures or unreliable non-FIFO channels. Both guarantee a correct and efficient recovery from a consistent distributed system state following a failure. A variant of the two-phase commit protocol is employed such that the communication overhead required to take a consistent checkpoint is the same as that of systems using a one-phase commit protocol, while our protocol utilises stable storage more efficiently. A consistent checkpoint is committed when the first phase of the protocol finishes
Keywords :
fault tolerant computing; protocols; shared memory systems; system recovery; communication overhead; consistent checkpointing; consistent distributed system state; distributed shared memory applications; failure; fault tolerance; message losses; one-phase commit protocol; recovery; site failures; stable storage; two-phase commit protocol; unreliable non-FIFO channels; Application software; Australia; Checkpointing; Computer science; Distributed computing; Fault tolerance; Fault tolerant systems; Protocols;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Object-Orientation in Operating Systems, 1995., Fourth International Workshop on
Conference_Location :
Lund
Print_ISBN :
0-8186-7115-7
Type :
conf
DOI :
10.1109/IWOOS.1995.470555
Filename :
470555
Link To Document :
بازگشت