Title :
Fault Tolerance in P2P-Grid Environments
Author :
Huan Wang ; Hidenori, Nakazato
Author_Institution :
Grad. Sch. of Global Inf. & Telecommun. Studies, Waseda Univ., Tokyo, Japan
Abstract :
P2P-Grid system provides a framework for converging Grid and peer-to-peer network to deploy large-scale distributed applications. However, working nodes with heterogeneous properties can freely join and leave in the middle of their computation. The nodes dynamic participation arbitrarily at any time according to user´s decision can keep changing the topology of the network and also causing more common execution failures than in other systems. To this end, failure detection mechanisms and fault tolerance function typically as an integral part of P2P-Grid system have been well-studied. Our research aims to address the highly dynamic nature that arises in P2P-Grid systems by understanding nodes life time statistics in previous research. We are proposing a Check pointing-and-Recovery architecture for applications restarting as soon as possible on P2P-Grid systems. And failure-detection mechanism is a necessary prerequisite to fault tolerance and fault recovery in P2P-Grid system. We also investigate how the design of various failure detection algorithms affects their performance in node average failure detection time. The evaluation shows our check pointing and restart paradigm and failure detection algorithm enables high reliability and performance with high node departure.
Keywords :
checkpointing; grid computing; peer-to-peer computing; software fault tolerance; P2P-grid environments; checkpointing-and-recovery architecture; common execution failures; failure detection mechanisms; fault recovery; fault tolerance; heterogeneous properties; high node departure; large-scale distributed applications; peer-to-peer network; restart paradigm; user decision; working nodes; Checkpointing; Computational modeling; Detection algorithms; Fault tolerance; Fault tolerant systems; Peer to peer computing; P2P-Grid; failure detection; failure recovery; fault tolerance;
Conference_Titel :
Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2012 IEEE 26th International
Conference_Location :
Shanghai
Print_ISBN :
978-1-4673-0974-5
DOI :
10.1109/IPDPSW.2012.308