Title :
Checkpoint and rollback in asynchronous distributed systems
Author :
Higaki, Hiroaki ; Shima, Kenji ; Tachikawa, Takayuki ; Takizawa, Makoto
Author_Institution :
Dept. of Comput. & Syst. Eng., Tokyo Denki Univ., Japan
Abstract :
This paper proposes a novel algorithm for taking checkpoints and rolling back the processes for recovery in asynchronous distributed systems. The algorithm has the following properties: (1) multiple processes can simultaneously initiate the checkpointing; (2) no additional message is transmitted for taking checkpoints; (3) a set of local checkpoints taken by multiple processes denotes a consistent global state; (4) multiple processes can initiate simultaneously the rollback recovery; (5) the minimum number of processes are rolled back; and (6) each process is rolled back asynchronously. The number of messages for rolling back the processes is O(l) where l is the number of channels. Therefore, the system is kept highly available by the algorithm presented
Keywords :
computer network reliability; distributed processing; algorithm; asynchronous distributed systems; channels; checkpoint; consistent global state; information systems; multiple processes; rollback recovery; Application software; Availability; Checkpointing; Distributed computing; Fault tolerant systems; Hardware; Information systems; Internet; Protocols; Systems engineering and theory;
Conference_Titel :
INFOCOM '97. Sixteenth Annual Joint Conference of the IEEE Computer and Communications Societies. Driving the Information Revolution., Proceedings IEEE
Conference_Location :
Kobe
Print_ISBN :
0-8186-7780-5
DOI :
10.1109/INFCOM.1997.631114