DocumentCode :
1338878
Title :
Probabilistic Model-Driven Recovery in Distributed Systems
Author :
Joshi, Kaustubh R. ; Hiltunen, Matti A. ; Sanders, William H. ; Schlichting, Richard D.
Author_Institution :
AT&T Labs. Res., Florham Park, NJ, USA
Volume :
8
Issue :
6
fYear :
2011
Firstpage :
913
Lastpage :
928
Abstract :
Automatic system monitoring and recovery has the potential to provide effective, low-cost ways to improve dependability in distributed software systems. However, automating recovery is challenging in practice because accurate fault diagnosis is hampered by monitoring tools and techniques that often have low fault coverage, poor fault localization, detection delays, and false positives. In this paper, we present a holistic model-based approach that overcomes these challenges and enables automatic recovery in distributed systems. To do so, it uses theoretically sound techniques including Bayesian estimation and Markov decision theory to provide controllers that choose good, if not optimal, recovery actions according to a user-defined optimization criteria. By combining monitoring and recovery, the approach realizes benefits that could not have been obtained by using them in isolation. We experimentally validate our framework by fault injection on realistic e-commerce systems.
Keywords :
Bayes methods; Markov processes; decision theory; distributed processing; fault diagnosis; software fault tolerance; system monitoring; system recovery; Bayesian estimation techniques; Markov decision theory; automatic system monitoring; automatic system recovery; detection delays; distributed software systems; fault diagnosis; fault injection; fault localization; probabilistic model-driven recovery; realistic e-commerce systems; Bayesian methods; Biomedical monitoring; Computer crashes; Diagnostic expert systems; Fault tolerance; Logic gates; Medical services; Bayesian.; Fault tolerance; POMDP; adaptive systems; diagnosis; distributed systems; monitoring; recovery;
fLanguage :
English
Journal_Title :
Dependable and Secure Computing, IEEE Transactions on
Publisher :
ieee
ISSN :
1545-5971
Type :
jour
DOI :
10.1109/TDSC.2010.45
Filename :
5590252
Link To Document :
بازگشت