DocumentCode
3448352
Title
Quantifying effectiveness of failure prediction and response in HPC systems: Methodology and example
Author
Brandt, James ; Chen, Frank ; De Sapio, Vincent ; Gentile, Ann ; Mayo, Jackson ; Pébay, Philippe ; Roe, Diana ; Thompson, David ; Wong, Matthew
Author_Institution
Sandia Nat. Labs., Livermore, CA, USA
fYear
2010
fDate
June 28 2010-July 1 2010
Firstpage
2
Lastpage
7
Abstract
Effective failure prediction and mitigation strategies in high-performance computing systems could provide huge gains in resilience of tightly coupled large-scale scientific codes. These gains would come from prediction-directed process migration and resource servicing, intelligent resource allocation, and checkpointing driven by failure predictors rather than at regular intervals based on nominal mean time to failure. Given probabilistic associations of outlier behavior in hardware-related metrics with eventual failure in hardware, system software, and/or applications, this paper explores approaches for quantifying the effects of prediction and mitigation strategies and demonstrates these using actual production system data. We describe context-relevant methodologies for determining the accuracy and cost-benefit of predictors.
Keywords
distributed processing; fault tolerant computing; statistical analysis; HPC systems; checkpointing; failure prediction; failure response; hardware-related metrics; high-performance computing systems; mitigation strategies; outlier behavior; prediction strategies; prediction-directed process migration; resource allocation; resource servicing; Checkpointing; Costs; Failure analysis; Hardware; Laboratories; Large-scale systems; Production systems; Resilience; Resource management; System software;
fLanguage
English
Publisher
ieee
Conference_Titel
Dependable Systems and Networks Workshops (DSN-W), 2010 International Conference on
Conference_Location
Chicago, IL
Print_ISBN
978-1-4244-7729-6
Electronic_ISBN
978-1-4244-7728-9
Type
conf
DOI
10.1109/DSNW.2010.5542629
Filename
5542629
Link To Document