DocumentCode :
2788571
Title :
Tiresias: Black-Box Failure Prediction in Distributed Systems
Author :
Williams, Andrew W. ; Pertet, Soila M. ; Narasimhan, Priya
Author_Institution :
Dept. of Electr. & Comput. Eng., Carnegie Mellon Univ., Pittsburgh, PA
fYear :
2007
fDate :
26-30 March 2007
Firstpage :
1
Lastpage :
8
Abstract :
Faults in distributed systems can result in errors that manifest in several ways, potentially even in parts of the system that are not collocated with the root cause. These manifestations often appear as deviations (or "errors") in performance metrics. By transparently gathering, and then identifying escalating anomalous behavior in, various node-level and system-level performance metrics, the Tiresias system makes black-box failure-prediction possible. Through the trend analysis of performance metrics, Tiresias provides a window of opportunity (look-ahead time) for system recovery prior to impending crash failures. We empirically validate the heuristic rules of the Tiresias system by analyzing fault-free and faulty performance data from a replicated middleware-based system.
Keywords :
distributed processing; failure analysis; fault tolerant computing; system recovery; black-box failure prediction; distributed system; fault tolerant computing; performance metric; replicated middleware-based system; system recovery; Computer crashes; Computer errors; Degradation; Failure analysis; Measurement; Middleware; Network servers; Performance analysis; Telecommunication traffic; Testing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International
Conference_Location :
Long Beach, CA
Print_ISBN :
1-4244-0910-1
Electronic_ISBN :
1-4244-0910-1
Type :
conf
DOI :
10.1109/IPDPS.2007.370345
Filename :
4228073
Link To Document :
بازگشت