Title :
Incorporating Fault Tolerance with Replication on Very Large Scale Grids
Author :
Sundararajan, Elankovan ; Harwood, Aaron ; Kotagiri, Ramamohanarao
Author_Institution :
Univ. Kebangsaan Malaysia, Bangi
Abstract :
Providing fault tolerance for message passing parallel application on a distributed environment is a rule rather than an exception. A node failure can cause the whole computation to stop and has to be restarted from the beginning if no fault tolerance is available. However, introducing fault tolerance has some overhead on speedup that can be achieved. In this paper, we introduce a new technique called replication with cross-over packets for reliability and to increase fault tolerance over Very Large Scale Grids (VLSG). This technique has two pronged effect of avoiding single point of failure and single link of failure. We incorporate this new technique into the L-BSP model and show the possible speedup of parallel process. We also derive the achievable speedup for some fundamental parallel algorithms using this technique.
Keywords :
fault tolerant computing; grid computing; message passing; parallel algorithms; L-BSP model; cross-over packets; distributed environment; fault tolerance; message passing parallel application; parallel algorithms; parallel process; very large scale grids; Application software; Concurrent computing; Distributed computing; Fault tolerance; Grid computing; Hardware; Large-scale systems; Parallel algorithms; Protocols; Wide area networks;
Conference_Titel :
Parallel and Distributed Computing, Applications and Technologies, 2007. PDCAT '07. Eighth International Conference on
Conference_Location :
Adelaide, SA
Print_ISBN :
0-7695-3049-4
DOI :
10.1109/PDCAT.2007.26