Title :
Concurrent robust checkpointing and recovery in distributed systems
Author :
Leu, Pei-jyun ; Bhargava, Bharat
Author_Institution :
Dept. of Comput. Sci., Purdue Univ., West Lafayette, IN, USA
Abstract :
A checkpoint/rollback algorithm is presented for multiple processes in a distributed system that uses message passing for communication. Each process in the system can initiate the algorithm autonomously. If only one instance of the algorithm is being executed, the algorithm will force the minimal number of additional processes other than the initiator to make checkpoints (or roll back). The contributions of this research are as follows: (1) the concurrent execution of the algorithm for different global checkpointing instances and rollback instances initiated by several processes is allowed. Deadlocks or livelocks among different global checkpointing instances and rollback instances will not occur; (2) the algorithm is resilient to multiple process failures, and handles network partitioning in a pessimistic way, and (3) the algorithm does not require that messages be received in the order in which they are sent
Keywords :
distributed databases; concurrent robust checkpointing; distributed systems; message passing; multiple process failures; recovery; rollback algorithm; Checkpointing; Concurrent computing; Content addressable storage; Distributed computing; Interference; Merging; Message passing; NASA; Partitioning algorithms; Robustness;
Conference_Titel :
Data Engineering, 1988. Proceedings. Fourth International Conference on
Conference_Location :
Los Angeles, CA
Print_ISBN :
0-8186-0827-7
DOI :
10.1109/ICDE.1988.105457