DocumentCode
2108338
Title
Automatic Problem Localization via Multi-dimensional Metric Profiling
Author
Laguna, Ignacio ; Mitra, Subhasish ; Arshad, Fahad A. ; Theera-Ampornpunt, Nawanol ; Zongyang Zhu ; Bagchi, Saurabh ; Midkiff, Samuel P. ; Kistler, Mike ; Gheith, Ahmed
fYear
2013
fDate
Sept. 30 2013-Oct. 3 2013
Firstpage
121
Lastpage
132
Abstract
Debugging today´s large-scale distributed applications is complex. Traditional debugging techniques such as breakpoint-based debugging and performance profiling require a substantial amount of domain knowledge and do not automate the process of locating bugs and performance anomalies. We present Orion, a framework to automate the problem-localization process in distributed applications. From a large set of metrics, Orion intelligently chooses important metrics and models the application´s runtime behavior through pair wise correlations of those metrics in the system, within multiple non-overlapping time windows. When correlations deviate from those of a learned correct model due to a bug, our analysis pinpoints the metrics and code regions (class and method within it) that are most likely associated with the failure. We demonstrate our framework with several real-world failure cases in distributed applications such as: HBase, Hadoop DFS, a campus-wide Java application, and a regression testing framework from IBM. Our results show that Orion is able to pinpoint the metrics and code regions that developers need to concentrate on to fix the failures.
Keywords
distributed processing; program debugging; software performance evaluation; statistical testing; HBase; Hadoop DFS; IBM; ORION; automatic problem localization; breakpoint-based debugging; bug locating process automation; campus-wide Java application; debugging techniques; large-scale distributed applications; multidimensional metric profiling; nonoverlapping time windows; performance anomalies; problem-localization process; regression testing framework; Algorithm design and analysis; Computer bugs; Correlation; Debugging; Hardware; Measurement; Principal component analysis; debugging aids; diagnostics; performance metrics; tracing;
fLanguage
English
Publisher
ieee
Conference_Titel
Reliable Distributed Systems (SRDS), 2013 IEEE 32nd International Symposium on
Conference_Location
Braga
Type
conf
DOI
10.1109/SRDS.2013.21
Filename
6656268
Link To Document