Author :
Tian, Guanhua ; Meng, Dan ; Li, Yong
Author_Institution :
Inst. of Comput. Technol., Chinese Acad. of Sci., Beijing, China
Abstract :
Locating and diagnosing performance faults in distributed systems is crucial but challenging. Distributed systems are increasingly complex, full of various correlation and dependency, and exhibit dramatic dynamics. All these made traditional approaches prone to high false alarms. In this paper, we propose a novel system modeling technique, which encodes component´s dynamic dependencies and behavior characteristics into system´s meta-model and takes it as a unifying framework to deploy component´s sub-models. We propose an automatic analyze approach to distill, from request travel paths, request path signatures, the essential information of component´s dynamic behaviors, and use it to induce metamodel with Bayesian network, and then use the model to make fault location and diagnoses. We take up fault-injection experiments with RUBiS, a TPCW alike benchmark, simulating eBay.com. The results indicate that our model approach provides effective problem diagnosis, i.e., Bayesian network technique is effective for fault detecting and pinpointing, in terms of request tracing context. Moreover, meta-model induced with request paths, provides an effective guidance for learning statistical correlations among metrics across the system, which effectively avoid ´false alarms´ in fault pinpointing. As a case study, we construct a proactive recovery framework, which integrate our system modeling technique with software rejuvenation technique to guarantee system´s quality of services.
Keywords :
belief networks; distributed processing; software fault tolerance; Bayesian network technique; RUBiS benchmark; distributed systems; performance fault diagnosis; proactive recovery framework; request path driven model; Bayesian methods; Correlation; Fault detection; Fault location; Measurement; Modeling; Probability distribution;