مرکز منطقه ای اطلاع رساني علوم و فناوري - Predicting Node Failure in High Performance Computing Systems from Failure and Usage Logs

DocumentCode :

3145987

Title :

Predicting Node Failure in High Performance Computing Systems from Failure and Usage Logs

Author :

Nakka, Nithin ; Agrawal, Ankit ; Choudhary, Alok

Author_Institution :

Coordinated Sci. Lab., Univ. of Illinois at Urbana-Champaign, Urbana, IL, USA

fYear :

2011

fDate :

16-20 May 2011

Firstpage :

1557

Lastpage :

1566

Abstract :

In this paper, we apply data mining classification schemes to predict failures in a high performance computer system. Failure and Usage data logs collected on supercomputing clusters at Los Alamos National Laboratory (LANL) were used to extract instances of failure information. For each failure instance, past and future failure information is accumulated -- time of usage, system idle time, time of unavailability, time since last failure, time to next failure. We performed two separate analyses, with and without classifying the failures based on their root cause. Based on this data, we applied some popular decision tree classifiers to predict if a failure would occur within 1 hour. Our experiments show that our prediction system predicts failures with a high-degree of precision up to 73% and recall of about 80%. We also observed that employing the usage data along with the failure data has improved the accuracy of prediction.

Keywords :

data mining; decision trees; pattern classification; system recovery; data mining classification; decision tree classifier; failure data log; failure information; high performance computing system; node failure; prediction system; supercomputing cluster; system idle time; unavailability time; usage data log; Data mining; Databases; Hardware; Humans; Maintenance engineering; Program processors;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on

Conference_Location :

Shanghai

ISSN :

1530-2075

Print_ISBN :

978-1-61284-425-1

Electronic_ISBN :

1530-2075

Type :

conf

DOI :

10.1109/IPDPS.2011.310

Filename :

6009015

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3145987