Title :
Fault tolerant distributed computing using atomic send-receive checkpoints
Author :
Wójcik, Zbigniew M. ; Wójcik, Barbara E.
Author_Institution :
Div. of Math., Comput. Sci. & Stat., Texas Univ., San Antonio, TX, USA
Abstract :
The paper presents a deadlock free fault recovery algorithm for an entirely distributed system in which the messages do not need to arrive in the order they have been sent. The method is based on the asynchronous, atomic checkpointing of the sender and receiver of a message. Messages not balanced in the last permanent checkpoints are recorded in the new checkpoints. The fault recovery is based on: (a) repetition of all messages lost according to a record of unbalanced messages in the last permanent checkpoints, and on (b) undoing every message re-sent during the fault recovery, or undoing of a computation repeated according to a record of unbalanced messages in the last permanent checkpoints. A fault recovery involves only processes which communicated before a failure. A distributed computation may be split into a few segments without affecting transaction consistency. The algorithm involves the minimum number of messages. Proof of the resilience of the fault recovery algorithm is presented
Keywords :
distributed processing; fault tolerant computing; system recovery; transaction processing; asynchronous messages; atomic checkpointing; atomic send-receive checkpoints; checkpoint consistency; deadlock free fault recovery algorithm; last permanent checkpoints; transaction consistency; unbalanced messages; Checkpointing; Computer science; Distributed computing; Error correction; Fault detection; Fault tolerance; Mathematics; Resilience; Statistical distributions; System recovery;
Conference_Titel :
Parallel and Distributed Processing, 1990. Proceedings of the Second IEEE Symposium on
Conference_Location :
Dallas, TX
Print_ISBN :
0-8186-2087-0
DOI :
10.1109/SPDP.1990.143536