Title :
Checkpoint and Recovery for Parallel Applications with Dynamic Number of Processes
Author :
Thoai, Nam ; Hung, Doan Viet
Author_Institution :
Ho Chi Minh City Univ. of Technol., Hochiminh City
Abstract :
This paper presents a checkpoint and recovery (C&R) protocol to support fault-tolerance for PVM (Parallel Virtual Machine). The protocol helps to mask fail-stop failures from an application. The C&R activities are transparent and do not require any change in the PVM library nor operating system. In PVM, an application can change the number of processes during execution. This paper focuses on solving problems raised by the dynamic spawn and the asynchronous exit of tasks in PVM. The proposed protocol is a non-blocking one, so it reduces side-effect of checkpoint activities of original programs.
Keywords :
checkpointing; fault tolerance; parallel machines; parallel programming; protocols; virtual machines; checkpoint and recovery protocol; fail-stop failures; fault-tolerance; parallel applications; parallel virtual machines; Application software; Checkpointing; Computer applications; Computer science; Distributed computing; Fault tolerance; Libraries; Operating systems; Protocols; Virtual machining;
Conference_Titel :
Parallel and Distributed Computing, 2007. ISPDC '07. Sixth International Symposium on
Conference_Location :
Hagenberg
DOI :
10.1109/ISPDC.2007.10