Title :
ProvErr: System Level Statistical Fault Diagnosis Using Dependency Model
Author :
Peng Chen ; Plale, Beth A.
Author_Institution :
Sch. of Inf. & Comput., Indiana Univ., Bloomington, IN, USA
Abstract :
Large-scale distributed systems are difficult to debug in the event of failure. Yet rapid fault diagnosis that pinpoints failures to the component level is critical to fast recovery. We introduce a statistical approach to fault diagnosis that utilizes a dependency graph of execution to automatically discover the most probable fault cause(s) at a component level (either software or hardware resource). This approach leverages engineers´ high level understanding of the system and requires a very small amount of information compared to existing methods. It also utilizes dependency information to eliminate redundant causes while retaining co-causes. Experiments using Apache Pig show that our approach has good, robust performance for diagnosing software bugs and resource shortages, and scales nearly linearly as system size increases.
Keywords :
fault diagnosis; system recovery; Apache Pig; ProvErr; dependency model; execution dependency graph; system level statistical fault diagnosis; Buildings; Computer bugs; Fault diagnosis; Hardware; Knowledge engineering; Runtime; Software;
Conference_Titel :
Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on
Conference_Location :
Shenzhen
DOI :
10.1109/CCGrid.2015.86