Title :
Scientific Computing Autonomic Reliability Framework
Author :
Dubey, Abhishek ; Neema, Sandeep ; Kowalkowski, Jim ; Singh, Amitoj
Author_Institution :
Inst. for Software Integrated Syst., Vanderbilt Univ., Nashville, TN
Abstract :
Large scientific computing clusters require a distributed dependability subsystem that can provide fault isolation and recovery and is capable of learning and predicting failures, to improve the reliability of scientific workflows. In this paper, we outline the key ideas in the design of a Scientific Computing Autonomic Reliability Framework (SCARF) for large computing clusters used in the Lattice Quantum Chromo Dynamics project at Fermi Lab.
Keywords :
fault diagnosis; fault tolerant computing; natural sciences computing; reliability; workflow management software; distributed dependability subsystem; fault isolation and recovery; scientific computing autonomic reliability framework; scientific workflows; Centralized control; Computer architecture; Condition monitoring; Engines; Environmental management; Fault diagnosis; Quantum computing; Resource management; Scientific computing; Software systems; Cluster Computing; Reliability; Software fault-tolerance; Workflows;
Conference_Titel :
eScience, 2008. eScience '08. IEEE Fourth International Conference on
Conference_Location :
Indianapolis, IN
Print_ISBN :
978-1-4244-3380-3
Electronic_ISBN :
978-0-7695-3535-7
DOI :
10.1109/eScience.2008.113