Title :
Fault-tolerance for macro dataflow parallel computations on grid
Author :
Jafar, Samir ; Roch, Jean-Louis
Author_Institution :
Lab. ID-IMAG, Monbonnot, France
Abstract :
We present a portable fault tolerant mechanism for execution of macro dataflow parallel programs on a large scale distributed and heterogeneous grid including SMP nodes. Our mechanism is based on a portable checkpoint-rollback and supports both parallel programs with dependencies and addition or resilience of heterogeneous resources. We have implemented this mechanism on top of Athapascan programming interface and experimental results are presented.
Keywords :
checkpointing; data flow computing; fault tolerant computing; grid computing; macros; multiprocessing systems; parallel languages; parallel programming; Athapascan programming interface; heterogeneous grid resources; macro dataflow parallel computation; portable fault tolerant mechanism; Computer architecture; Concurrent computing; Distributed computing; Fault tolerance; Grid computing; Large-scale systems; Parallel languages; Parallel processing; Portable computers; Resilience;
Conference_Titel :
Information and Communication Technologies: From Theory to Applications, 2004. Proceedings. 2004 International Conference on
Print_ISBN :
0-7803-8482-2
DOI :
10.1109/ICTTA.2004.1307897