Title :
Performance optimization of checkpointing schemes with task duplication
Author :
Ziv, Avi ; Bruck, Jehoshua
Author_Institution :
IBM Israel Sci. & Technol. Center, Haifa, Israel
fDate :
12/1/1997 12:00:00 AM
Abstract :
In checkpointing schemes with task duplication, checkpointing serves two purposes: detecting faults by comparing the processors´ states at checkpoints, and reducing fault recovery time by supplying a safe point to rollback to. In this paper, we show that, by tuning the checkpointing schemes to a given architecture, a significant reduction in the execution time can be achieved. The main idea is to use two types of checkpoints: compare-checkpoints (comparing the states of the redundant processes to detect faults) and store-checkpoints (storing the states to reduce recovery time). With two types of checkpoints, we can use both the comparison and storage operations in an efficient way and improve the performance of checkpointing schemes. Results we obtained show that, in some cases, using compare and store checkpoints can reduce the overhead of DMR checkpointing schemes by as much as 30 percent
Keywords :
fault tolerant computing; performance evaluation; DMR checkpointing schemes; checkpointing schemes; compare-checkpoints; fault recovery time; performance optimization; store-checkpoints; task duplication; Bandwidth; Checkpointing; Concurrent computing; Ethernet networks; Fault detection; Local area networks; Optimization; Parallel processing; Postal services; Workstations;
Journal_Title :
Computers, IEEE Transactions on