DocumentCode :
2923088
Title :
Latent fault detection in large scale services
Author :
Gabel, Moshe ; Schuster, Assaf ; Bachrach, Ran-Gilad ; Bjørner, Nikolaj
Author_Institution :
Dept. of Comput. Sci., Technion - Israel Inst. of Technol., Haifa, Israel
fYear :
2012
fDate :
25-28 June 2012
Firstpage :
1
Lastpage :
12
Abstract :
Unexpected machine failures, with their resulting service outages and data loss, pose challenges to datacenter management. Existing failure detection techniques rely on domain knowledge, precious (often unavailable) training data, textual console logs, or intrusive service modifications. We hypothesize that many machine failures are not a result of abrupt changes but rather a result of a long period of degraded performance. This is confirmed in our experiments, in which over 20% of machine failures were preceded by such latent faults. We propose a proactive approach for failure prevention. We present a novel framework for statistical latent fault detection using only ordinary machine counters collected as standard practice. We demonstrate three detection methods within this framework. Derived tests are domain-independent and unsupervised, require neither background information nor tuning, and scale to very large services. We prove strong guarantees on the false positive rates of our tests.
Keywords :
Web services; learning (artificial intelligence); software fault tolerance; statistical analysis; Web services; data loss; datacenter management; distributed computing; failure detection techniques; failure prevention; large scale services; service outages; statistical analysis; statistical latent fault detection; statistical learning; unexpected machine failures; Fault detection; Hardware; Monitoring; Radiation detectors; Support vector machines; Tuning; Vectors; distributed computing; fault detection; statistical analysis; statistical learning; web services;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Dependable Systems and Networks (DSN), 2012 42nd Annual IEEE/IFIP International Conference on
Conference_Location :
Boston, MA
ISSN :
1530-0889
Print_ISBN :
978-1-4673-1624-8
Electronic_ISBN :
1530-0889
Type :
conf
DOI :
10.1109/DSN.2012.6263932
Filename :
6263932
Link To Document :
بازگشت