Title :
A crash recovery technique in distributed computing systems
Author :
Young, Cheng-Ru ; Chiu, Ge-Ming
Author_Institution :
Dept. of Electr. Eng. & Technol., Nat. Taiwan Inst. of Technol., Taipei, Taiwan
Abstract :
In this paper we propose a new mechanism for implementing checkpoint/rollback-recovery in a distributed computing system. A logical-ring structure is introduced for the maintenance of recovery-related information. Message processing order of a process is maintained by all other processes on its associated ring. It requires no time-consuming operations of writing order information into stable storage. As a result, fail-free overhead is small. When failures occur, only failed processes have to roll back to their latest checkpoints. Surviving processes continue execution without being blocked. Output commit is fast as it needs no synchronization before a message is sent to the outside world
Keywords :
distributed processing; fault tolerant computing; message passing; software reliability; system recovery; checkpoint/rollback-recovery; crash recovery technique; distributed computing systems; fail-free overhead; logical-ring structure; message processing; recovery-related information; Binary search trees; Checkpointing; Computer crashes; Costs; Delay; Distributed computing; Fault tolerant systems; Optimization methods; Resumes; Writing;
Conference_Titel :
Distributed Computing Systems, 1994., Proceedings of the 14th International Conference on
Conference_Location :
Pozman
Print_ISBN :
0-8186-5840-1
DOI :
10.1109/ICDCS.1994.302417