DocumentCode :
3507437
Title :
A Solution for Fault-Tolerance Based on Adaptive Replication in MonALISA
Author :
Costan, Alexandru ; Andreica, Mugurel Ionut ; Cristea, Valentin ; Grigoras, Costin
Author_Institution :
Comput. Sci. Dept., Univ. Politeh. of Bucharest, Bucharest, Romania
fYear :
2010
fDate :
4-6 Nov. 2010
Firstpage :
375
Lastpage :
380
Abstract :
The domains of usage of large-scale distributed systems have been extending during the past years from scientific to commercial applications. Together with the extension of the application domains, new requirements have emerged for large-scale distributed systems. Among these, fault tolerance is needed by more and more modern distributed applications, not only by the critical ones. In this paper we present a solution aiming at fault tolerant monitoring of the distributed systems within the MonALISA framework. Our approach uses replication and guarantees that all processing replicas achieve state consistency, both in the absence of failures and after failure recovery. We achieve consistency in the former case by implementing a module that ensures that the order of monitoring tuples is the same at all the replicas. To achieve consistency after failure recovery, we rely on check pointing techniques. We address the optimization problem of the replication architecture by dynamically monitoring and estimating inter-replica link throughputs and real-time replica status. We demonstrate the strengths of our solution using the MonALISA monitoring application in a distributed environment. Our tests show that the proposed approach outperforms previous solutions in terms of latency and that it uses system resources efficiently by carefully updating replicas, while keeping overhead very low.
Keywords :
distributed processing; optimisation; software architecture; software fault tolerance; MonALISA; adaptive replication; failure recovery; fault tolerant monitoring; fault-tolerance; inter-replica link throughputs; large-scale distributed systems; optimization problem; real-time replica status; replication architecture; Grid computing; distributed systems; fault tolerance; monitoring; replication;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
P2P, Parallel, Grid, Cloud and Internet Computing (3PGCIC), 2010 International Conference on
Conference_Location :
Fukuoka
Print_ISBN :
978-1-4244-8538-3
Electronic_ISBN :
978-0-7695-4237-9
Type :
conf
DOI :
10.1109/3PGCIC.2010.63
Filename :
5662762
Link To Document :
بازگشت