Title :
DejaVu: Transparent User-Level Checkpointing, Migration, and Recovery for Distributed Systems
Author :
Ruscio, Joseph F. ; Heffner, Michael A. ; Varadarajan, Srinidhi
Author_Institution :
Dept. of Comput. Sci., Virginia Tech., VA
Abstract :
In this paper, we present a new fault tolerance system called DejaVu for transparent and automatic checkpointing, migration, and recovery of parallel and distributed applications. DejaVu provides a transparent parallel checkpointing and recovery mechanism that recovers from any combination of systems failures without any modification to parallel applications or the OS. It uses a new runtime mechanism for transparent incremental checkpointing that captures the least amount of state needed to maintain global consistency and provides a novel communication architecture that enables transparent migration of existing MPI codes, without source-code modifications. Performance results from the production-ready implementation show less than 5% overhead in real-world parallel applications with large memory footprints.
Keywords :
application program interfaces; checkpointing; fault tolerant computing; message passing; parallel processing; system monitoring; DejaVu fault tolerance system; MPI code; communication architecture; distributed system automatic migration; distributed system automatic recovery; runtime mechanism; system failure; transparent incremental checkpointing; transparent parallel user-level checkpointing; Application software; Checkpointing; Computer networks; Computer science; Concurrent computing; Distributed computing; Fault tolerant systems; Laboratories; Runtime; Stability;
Conference_Titel :
Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International
Conference_Location :
Long Beach, CA
Print_ISBN :
1-4244-0910-1
Electronic_ISBN :
1-4244-0910-1
DOI :
10.1109/IPDPS.2007.370309