DocumentCode :
3062300
Title :
Proficiency Metrics for Failure Prediction in High Performance Computing
Author :
Taerat, Narate ; Leangsuksun, Chokchai Box ; Chandler, Clayton ; Naksinehaboon, Nichamon
Author_Institution :
Coll. of Eng. & Sci., Louisiana Tech Univ., Ruston, LA, USA
fYear :
2010
fDate :
6-9 Sept. 2010
Firstpage :
491
Lastpage :
498
Abstract :
The number of failures occurring in large-scale high performance computing (HPC) systems is significantly increasing due to the large number of physical components found on the system. Fault tolerance (FT) mechanisms help parallel applications mitigate the impact of failures. However, using such mechanisms requires additional overhead. As such, failure prediction is needed in order to smartly utilize FT mechanisms. Hence, the proficiency of a failure prediction determines the efficiency of FT mechanism utilization. The proficiency of a failure predictor in HPC is usually designated by well-known error measurements, e.g. MSE, MAD, precision and recall, in which less error infers the greater proficiency. In this manuscript, we propose to view prediction proficiency from another aspect-lost computing time. We then discuss the insufficiency of error measurements as HPC failure prediction proficiency metrics from the aspect of lost computing time, and propose novel metrics that address these issues.
Keywords :
checkpointing; fault tolerant computing; multiprocessing systems; failure prediction; fault tolerance mechanism; large scale high performance computing system; proficiency metrics; Computational modeling; Loss measurement; Measurement uncertainty; Predictive models; Time measurement; Time series analysis;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Processing with Applications (ISPA), 2010 International Symposium on
Conference_Location :
Taipei
Print_ISBN :
978-1-4244-8095-1
Electronic_ISBN :
978-0-7695-4190-7
Type :
conf
DOI :
10.1109/ISPA.2010.84
Filename :
5634371
Link To Document :
بازگشت