Title :
Reducing message logging overhead for log-based recovery
Author_Institution :
Coordinated Sci. Lab., Illinois,Univ., Urbana, IL, USA
Abstract :
Checkpointing and rollback recovery is essential for long-running parallel applications. In the case of a transient fault or system crash, the affected application programs can recover from a consistent set of checkpoints saved earlier instead of restarting from the very beginning. For applications requiring transparent fault tolerance, log-based recovery can usually achieve a better recoverable state at the cost of message logging in addition to checkpointing. A simple scheme for reducing message logging overhead based on local dependency information is presented. Communication trace-driven simulation for several parallel applications is used to evaluate the benefits of the proposed scheme for real applications
Keywords :
fault tolerant computing; message passing; parallel processing; system recovery; checkpointing; communication trace-driven simulation; local dependency information; log-based recovery; long-running parallel applications; message logging overhead; recoverable state; rollback recovery; system crash; transient fault; transparent fault tolerance; Application software; Checkpointing; Computer crashes; Concurrent computing; Costs; Hardware; Protocols;
Conference_Titel :
Circuits and Systems, 1993., ISCAS '93, 1993 IEEE International Symposium on
Conference_Location :
Chicago, IL
Print_ISBN :
0-7803-1281-3
DOI :
10.1109/ISCAS.1993.394126