Abstract :
Summary form only given. Our ability to design and deploy large complex systems is outpacing our ability to understand their behavior. How do we detect and recover from "heisenbugs", which account for up to 40% of failures in complex Internet systems, without extensive application-specific coding? Which users were affected, and for how long? How do we diagnose and correct problems caused by configuration errors or operator errors? Although these problems are posed at a high level of abstraction, all we can usually measure directly are low-level behaviors - analogous to driving a car while looking through a magnifying glass. Machine learning can bridge this gap using techniques that learn "baseline" models automatically or semi-automatically, allowing the characterization and monitoring of systems whose structure is not well understood a priori. This paper discusses initial successes and future challenges in using machine learning for failure detection and diagnosis, configuration troubleshooting, attribution (which low-level properties appear to be correlated with an observed high-level effect such as decreased performance), and failure forecasting.
Keywords :
learning (artificial intelligence); program diagnostics; software reliability; statistical analysis; configuration troubleshooting; machine learning; software dependability; software failure detection; software failure diagnosis; software failure forecasting; statistical techniques; system monitoring; Bridges; Computerized monitoring; Condition monitoring; Error correction; Glass; Internet; Machine learning;