Title :
Online Fault and Anomaly Detection for Large-Scale Scientific Workflows
Author :
Samak, Taghrid ; Gunter, Daniel ; Goode, Monte ; Deelman, Ewa ; Juve, Gideon ; Mehta, Gaurang ; Silva, Fabio ; Vahi, Karan
Author_Institution :
Lawrence Berkeley Nat. Lab., Berkeley, CA, USA
Abstract :
Scientific workflows are an enabler of complex scientific analyses. Large-scale scientific workflows are executed on complex parallel and distributed resources, where many things can fail. Application scientists need to track the status of their workflows in real time, detect execution anomalies automatically, and perform troubleshooting -- without logging into remote nodes or searching through thousands of log files. As part of the NSF-funded Synthesized Tools for Archiving Monitoring Performance and Enhanced DEbugging (STAMPEDE) project, we have developed an infrastructure to answer these needs by integrating detailed workflow and resource monitoring. On top of this infrastructure, we have developed analysis techniques for online detection of a wide variety of "hard" and "soft" types of failures. We use these detected failures to derive higher-level statistics about the status of the resources and the workflow as a whole. In this paper, we describe our techniques and evaluate their effectiveness in the context of real application logs.
Keywords :
distributed processing; fault tolerant computing; program debugging; statistical analysis; STAMPEDE; anomaly detection; distributed resources; large-scale scientific workflow; online fault detection; parallel resources; synthesized tools for archiving monitoring performance and enhanced debugging; Algorithm design and analysis; Broadband communication; Clustering algorithms; Databases; Measurement; Monitoring; Vectors; Scientific workflows; failure prediction; workflow management;
Conference_Titel :
High Performance Computing and Communications (HPCC), 2011 IEEE 13th International Conference on
Conference_Location :
Banff, AB
Print_ISBN :
978-1-4577-1564-8
Electronic_ISBN :
978-0-7695-4538-7
DOI :
10.1109/HPCC.2011.55