DocumentCode :
3448304
Title :
A practical failure prediction with location and lead time for Blue Gene/P
Author :
Zheng, Ziming ; Lan, Zhiling ; Gupta, Rinku ; Coghlan, Susan ; Beckman, Peter
Author_Institution :
Dept. of Comput. Sci., Illinois Inst. of Technol., Chicago, IL, USA
fYear :
2010
fDate :
June 28 2010-July 1 2010
Firstpage :
15
Lastpage :
22
Abstract :
Analyzing, understanding and predicting failure is of paramount importance to achieve effective fault management. While various fault prediction methods have been studied in the past, many of them are not practical for use in real systems. In particular, they fail to address two crucial issues: one is to provide location information (i.e., the components where the failure is expected to occur on) and the other is to provide sufficient lead time (i.e., the time interval preceding the time of failure occurrence). In this paper, we first refine the widely-used metrics for evaluating prediction accuracy by including location as well as lead time. We, then, present a practical failure prediction mechanism for IBM Blue Gene systems. A Genetic Algorithm based method is exploited, which takes into consideration the location and the lead time for failure prediction. We demonstrate the effectiveness of this mechanism by means of real failure logs and job logs collected from the IBM Blue Gene/P system at Argonne National Laboratory. Our experiments show that the presented method can significantly improve fault management (e.g., to reduce service unit loss by up to 52.4%) by incorporating location and lead time information in the prediction.
Keywords :
computer networks; fault tolerant computing; genetic algorithms; prediction theory; Argonne national laboratory; IBM blue gene systems; failure logs; fault management; genetic algorithm; job logs; practical failure prediction; Accuracy; Checkpointing; Computer networks; Computer science; Failure analysis; Genetic algorithms; Laboratories; Lead time reduction; Mathematics; Prediction methods;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Dependable Systems and Networks Workshops (DSN-W), 2010 International Conference on
Conference_Location :
Chicago, IL
Print_ISBN :
978-1-4244-7729-6
Electronic_ISBN :
978-1-4244-7728-9
Type :
conf
DOI :
10.1109/DSNW.2010.5542627
Filename :
5542627
Link To Document :
بازگشت