Title :
Faster checkpointing with N+1 parity
Author :
Plank, J.S. ; Kai Li
Author_Institution :
Dept. of Comput. Sci., Tennessee Univ., Knoxville, TN, USA
Abstract :
This paper presents a way to perform fast incremental checkpointing of multicomputers and distributed systems by using N+1 parity. A basic algorithm is described that uses two extra processors for checkpointing and enables the system to tolerate any single processor failure. The algorithm´s speed comes from a combination of N+1 parity, extra physical memory, and virtual memory hardware so that checkpoints need not be written to disk. This eliminates the most time-consuming portion of checkpointing. The algorithm requires each application processor to allocate a fixed amount of extra memory for checkpointing. This amount may be set statically by the programmer, and need not be equal to the site of the processor´s writable address space. This alleviates a major restriction of previous checkpointing algorithms using N+1 parity. Finally, we outline how to extend our algorithm to tolerate any m processor failures with the addition of 2m extra checkpointing processors.<>
Keywords :
distributed processing; fault tolerant computing; reliability; virtual storage; N+1 parity; checkpointing; distributed systems; multicomputers; single processor failure; virtual memory hardware; Checkpointing; Computer science; Debugging; Fault tolerance; Hardware; Magnetic heads; Nonvolatile memory; Programming profession; Read-write memory; Writing;
Conference_Titel :
Fault-Tolerant Computing, 1994. FTCS-24. Digest of Papers., Twenty-Fourth International Symposium on
Conference_Location :
Austin, TX, USA
Print_ISBN :
0-8186-5520-8
DOI :
10.1109/FTCS.1994.315631