Title : 
Theft-induced checkpointing for reconfigurable dataflow applications
         
        
            Author : 
Jafar, Syed ; Krings, Axel W. ; Gautier, Thierry ; Roch, Jean-Louis
         
        
            Author_Institution : 
Lab. ID-IMAG, Montbonnot Saint-Martin
         
        
        
        
        
            Abstract : 
In this paper a new checkpoint/recovery protocol called theft-induced checkpointing is defined for dataflow computations in large heterogeneous environments. The protocol is especially useful in massively parallel multi-threaded computations as found in cluster or grid computing and utilizes the principle of work-stealing to distribute work. By basing the state of executions on a macro dataflow graph, the protocol shows extreme flexibility with respect to rollback. Specifically, it allows local rollback in dynamic heterogeneous systems, even under a different number of processors and processes. To maximize run-time efficiency, the overhead associated with checkpointing is shifted to the rollback operations whenever possible. Experimental results show the overhead induced is very small
         
        
            Keywords : 
checkpointing; data flow computing; data flow graphs; fault tolerance; multi-threading; protocols; checkpoint protocol; dynamic heterogeneous systems; large heterogeneous environments; macrodataflow graph; reconfigurable dataflow applications; recovery protocol; theft-induced checkpointing; Application software; Checkpointing; Computer applications; Computer networks; Fault tolerance; Grid computing; Parallel architectures; Protocols; Redundancy; Runtime;
         
        
        
        
            Conference_Titel : 
Electro Information Technology, 2005 IEEE International Conference on
         
        
            Conference_Location : 
Lincoln, NE
         
        
            Print_ISBN : 
0-7803-9232-9
         
        
        
            DOI : 
10.1109/EIT.2005.1626998