Title :
Theft-induced checkpointing for reconfigurable dataflow applications
Author :
Jafar, Syed ; Krings, Axel W. ; Gautier, Thierry ; Roch, Jean-Louis
Author_Institution :
Lab. ID-IMAG, Montbonnot Saint-Martin
Abstract :
In this paper a new checkpoint/recovery protocol called theft-induced checkpointing is defined for dataflow computations in large heterogeneous environments. The protocol is especially useful in massively parallel multi-threaded computations as found in cluster or grid computing and utilizes the principle of work-stealing to distribute work. By basing the state of executions on a macro dataflow graph, the protocol shows extreme flexibility with respect to rollback. Specifically, it allows local rollback in dynamic heterogeneous systems, even under a different number of processors and processes. To maximize run-time efficiency, the overhead associated with checkpointing is shifted to the rollback operations whenever possible. Experimental results show the overhead induced is very small
Keywords :
checkpointing; data flow computing; data flow graphs; fault tolerance; multi-threading; protocols; checkpoint protocol; dynamic heterogeneous systems; large heterogeneous environments; macrodataflow graph; reconfigurable dataflow applications; recovery protocol; theft-induced checkpointing; Application software; Checkpointing; Computer applications; Computer networks; Fault tolerance; Grid computing; Parallel architectures; Protocols; Redundancy; Runtime;
Conference_Titel :
Electro Information Technology, 2005 IEEE International Conference on
Conference_Location :
Lincoln, NE
Print_ISBN :
0-7803-9232-9
DOI :
10.1109/EIT.2005.1626998