DocumentCode :
3448288
Title :
Accurate fault prediction of BlueGene/P RAS logs via geometric reduction
Author :
Thompson, Joshua ; Dreisigmeyer, David W. ; Jones, Terry ; Kirby, Michael ; Ladd, Joshua
Author_Institution :
Dept. of Math., Colorado State Univ., Fort Collins, CO, USA
fYear :
2010
fDate :
June 28 2010-July 1 2010
Firstpage :
8
Lastpage :
14
Abstract :
This investigation presents two distinct and novel approaches for the prediction of system failures occurring in Oak Ridge National Laboratory´s Blue Gene/P supercomputer. Each technique uses raw numeric and textual subsets of large data logs of physical system information such as fan speeds and CPU temperatures. This data is used to develop models of the system capable of sensing anomalies, or deviations from nominal behavior. Each algorithm predicted event log reported anomalies in advance of their occurrence and one algorithm did so without false positives. Both algorithms predicted an anomaly that did not appear in the event log. It was later learned that the fault missing from the log but predicted by both algorithms was confirmed to have occurred by the system administrator.
Keywords :
fault diagnosis; mainframes; system recovery; BlueGene/P; CPU temperatures; RAS logs; data logs; fan speeds; fault prediction; geometric reduction; numeric subsets; physical system information; supercomputer; system administrator; system failures prediction; textual subsets; Hardware; High performance computing; Information analysis; Laboratories; Machine learning; Mathematics; Prediction algorithms; Supercomputers; Switches; Telecommunication switching; MSET; NMF; fault prediction; high performance computing; resiliency;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Dependable Systems and Networks Workshops (DSN-W), 2010 International Conference on
Conference_Location :
Chicago, IL
Print_ISBN :
978-1-4244-7729-6
Electronic_ISBN :
978-1-4244-7728-9
Type :
conf
DOI :
10.1109/DSNW.2010.5542626
Filename :
5542626
Link To Document :
بازگشت