Title :
Design and analysis of a hardware-assisted checkpointing and recovery scheme for distributed applications
Author :
Ramamurthy, Bina ; Upadhyaya, Shambhu ; Bhargava, Bharat
Author_Institution :
Dept. of Comput. Sci. & Eng., State Univ. of New York, Buffalo, NY, USA
Abstract :
A checkpointing and recovery scheme which exploits the low latency and high coverage characteristics of a hardware error detection scheme is presented. Message dependency which is the main source of multi-step rollback in distributed systems is minimized by using a new message validation technique derived from hardware-assisted error detection. The main contribution of this paper is the development of an analytical model to establish the completeness and correctness of the new scheme. A novel concept of global state matrix is defined to keep track of the global state in a distributed system and assist in recovery. An illustration is given to show the distinction between conventional and the new recovery schemes
Keywords :
fault tolerant computing; message passing; system recovery; distributed applications; distributed system; global state matrix; hardware error detection scheme; hardware-assisted checkpointing; hardware-assisted error detection; high coverage; low latency; message dependency; message validation technique; multi-step rollback; recovery scheme; Analytical models; Application software; Checkpointing; Computer errors; Computer science; Delay; Error correction; Hardware; Real time systems; Redundancy;
Conference_Titel :
Reliable Distributed Systems, 1998. Proceedings. Seventeenth IEEE Symposium on
Conference_Location :
West Lafayette, IN
Print_ISBN :
0-8186-9218-9
DOI :
10.1109/RELDIS.1998.740478