Title :
User-triggered checkpointing: system-independent and scalable application recovery
Author :
Deconinck, Geert ; Lauwereins, Rudy
Author_Institution :
Electrotech. Dept., Katholieke Univ., Leuven, Heverlee, Belgium
Abstract :
User-triggered checkpointing and rollback is proposed as a system-independent and flexible way to integrate backward error recovery in long-running, computation-intensive message-passing applications on large parallel multicomputers. It employs library calls to coordinate the checkpointing, allowing a non-blocking and scalable approach that requires no protocol to save a consistent state because the coordination among the processes is implicit. The explicit indication of the checkpoint contents (i.e. the items of which the state must be saved) allows one to significantly reduce the amount of checkpoint data and the overhead. In contrast to other checkpointing approaches, the implementation does not rely on system-dependent features (like saving register-values or communication status) to save the state. Instead, re-executing the first part of the application brings the system-specific items into a consistent state with the rest of the checkpoint contents that is restored from the saved checkpoint data
Keywords :
fault tolerant computing; message passing; parallel machines; system recovery; backward error recovery; checkpoint contents; checkpoint data; message-passing applications; nonblocking approach; overhead reduction; parallel multicomputers; rollback; scalable application recovery; scalable approach; system-independent recovery; user-triggered checkpointing; Application software; Checkpointing; Computational modeling; Computer applications; Concurrent computing; High performance computing; Libraries; Power system modeling; Power system restoration; Protocols;
Conference_Titel :
Computers and Communications, 1997. Proceedings., Second IEEE Symposium on
Conference_Location :
Alexandria
Print_ISBN :
0-8186-7852-6
DOI :
10.1109/ISCC.1997.616035