Title :
Predicting Node Failure in High Performance Computing Systems from Failure and Usage Logs
Author :
Nakka, Nithin ; Agrawal, Ankit ; Choudhary, Alok
Author_Institution :
Coordinated Sci. Lab., Univ. of Illinois at Urbana-Champaign, Urbana, IL, USA
Abstract :
In this paper, we apply data mining classification schemes to predict failures in a high performance computer system. Failure and Usage data logs collected on supercomputing clusters at Los Alamos National Laboratory (LANL) were used to extract instances of failure information. For each failure instance, past and future failure information is accumulated -- time of usage, system idle time, time of unavailability, time since last failure, time to next failure. We performed two separate analyses, with and without classifying the failures based on their root cause. Based on this data, we applied some popular decision tree classifiers to predict if a failure would occur within 1 hour. Our experiments show that our prediction system predicts failures with a high-degree of precision up to 73% and recall of about 80%. We also observed that employing the usage data along with the failure data has improved the accuracy of prediction.
Keywords :
data mining; decision trees; pattern classification; system recovery; data mining classification; decision tree classifier; failure data log; failure information; high performance computing system; node failure; prediction system; supercomputing cluster; system idle time; unavailability time; usage data log; Data mining; Databases; Hardware; Humans; Maintenance engineering; Program processors;
Conference_Titel :
Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on
Conference_Location :
Shanghai
Print_ISBN :
978-1-61284-425-1
Electronic_ISBN :
1530-2075
DOI :
10.1109/IPDPS.2011.310