DocumentCode :
2539415
Title :
Automatic model-driven recovery in distributed systems
Author :
Joshi, Kaustubh R. ; Hiltunen, Matti A. ; Sanders, William H. ; Schlichting, Richard D.
Author_Institution :
Coordinated Sci. Lab., Illinois Univ., Urbana, IL, USA
fYear :
2005
fDate :
26-28 Oct. 2005
Firstpage :
25
Lastpage :
36
Abstract :
Automatic system monitoring and recovery has the potential to provide a low-cost solution for high availability. However, automating recovery is difficult in practice because of the challenge of accurate fault diagnosis in the presence of low coverage, poor localization ability, and false positives that are inherent in many widely used monitoring techniques. In this paper, we present a holistic model-based approach that overcomes these challenges and enables automatic recovery in distributed systems. To do so, it uses theoretically sound techniques including Bayesian estimation and Markov decision theory to provide controllers that choose good, if not optimal, recovery actions according to a user-defined optimization criteria. By combining monitoring and recovery, the approach realizes benefits that could not have been obtained by using them in isolation. In this paper, we present two recovery algorithms with complementary properties and trade-offs, and validate our algorithms (through simulation) by fault injection on a realistic e-commerce system.
Keywords :
Bayes methods; Markov processes; decision theory; distributed processing; fault diagnosis; fault tolerant computing; optimisation; system monitoring; system recovery; Bayesian estimation; Markov decision theory; automatic model-driven recovery; automatic system monitoring; automatic system recovery; distributed system; e-commerce system; fault diagnosis; fault injection; optimization; Application software; Availability; Bayesian methods; Computerized monitoring; Condition monitoring; Decision theory; Fault diagnosis; Optimal control; Redundancy; Testing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Reliable Distributed Systems, 2005. SRDS 2005. 24th IEEE Symposium on
Print_ISBN :
0-7695-2463-X
Type :
conf
DOI :
10.1109/RELDIS.2005.11
Filename :
1541182
Link To Document :
بازگشت