Title :
Ensembles of models for automated diagnosis of system performance problems
Author :
Zhang, Steve ; Cohen, Ira ; Goldszmidt, Moises ; Symons, Julie ; Fox, Armando
Author_Institution :
Stanford Univ., CA, USA
fDate :
28 June-1 July 2005
Abstract :
Violations of service level objectives (SLO) in Internet services are urgent conditions requiring immediate attention. Previously we explored (I. Cohen et al., 2004) an approach for identifying which low-level system properties were correlated to high-level SLO violations (the metric attribution problem). The approach is based on automatically inducing models from data using pattern recognition and probability modeling techniques. In this paper we extend our approach to adapt to changing workloads and external disturbances by maintaining an ensemble of probabilistic models, adding new models when existing ones do not accurately capture current system behavior. Using realistic workloads on an implemented prototype system, we show that the ensemble of models captures the performance behavior of the system accurately under changing workloads and conditions. We fuse information from the models in the ensemble to identify likely causes of the performance problem, with results comparable to those produced by an oracle that continuously changes the model based on advance knowledge of the workload. The cost of inducing new models and managing the ensembles is negligible, making our approach both immediately practical and theoretically appealing.
Keywords :
Internet; belief networks; fault diagnosis; fault tolerant computing; pattern recognition; probability; Bayesian model management; Internet services; automated diagnosis; pattern recognition; probability modeling techniques; self-healing systems; self-monitoring systems; service level objectives; statistical induction; system performance; Availability; Bayesian methods; Delay; Hardware; Pattern recognition; Sensor phenomena and characterization; Sensor systems; System performance; Web and internet services; Web server; Automated diagnosis; self-healing and selfmonitoring systems; statistical induction and Bayesian Model Management;
Conference_Titel :
Dependable Systems and Networks, 2005. DSN 2005. Proceedings. International Conference on
Print_ISBN :
0-7695-2282-3
DOI :
10.1109/DSN.2005.44