DocumentCode :
2194898
Title :
Scientific Computing Autonomic Reliability Framework
Author :
Dubey, Abhishek ; Neema, Sandeep ; Kowalkowski, Jim ; Singh, Amitoj
Author_Institution :
Inst. for Software Integrated Syst., Vanderbilt Univ., Nashville, TN
fYear :
2008
fDate :
7-12 Dec. 2008
Firstpage :
352
Lastpage :
353
Abstract :
Large scientific computing clusters require a distributed dependability subsystem that can provide fault isolation and recovery and is capable of learning and predicting failures, to improve the reliability of scientific workflows. In this paper, we outline the key ideas in the design of a Scientific Computing Autonomic Reliability Framework (SCARF) for large computing clusters used in the Lattice Quantum Chromo Dynamics project at Fermi Lab.
Keywords :
fault diagnosis; fault tolerant computing; natural sciences computing; reliability; workflow management software; distributed dependability subsystem; fault isolation and recovery; scientific computing autonomic reliability framework; scientific workflows; Centralized control; Computer architecture; Condition monitoring; Engines; Environmental management; Fault diagnosis; Quantum computing; Resource management; Scientific computing; Software systems; Cluster Computing; Reliability; Software fault-tolerance; Workflows;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
eScience, 2008. eScience '08. IEEE Fourth International Conference on
Conference_Location :
Indianapolis, IN
Print_ISBN :
978-1-4244-3380-3
Electronic_ISBN :
978-0-7695-3535-7
Type :
conf
DOI :
10.1109/eScience.2008.113
Filename :
4736792
Link To Document :
بازگشت