مرکز منطقه ای اطلاع رساني علوم و فناوري - Online failure prediction for HPC resources using decentralized clustering

DocumentCode :

3591182

Title :

Online failure prediction for HPC resources using decentralized clustering

Author :

Pelaez, Alejandro ; Quiroz, Andres ; Browne, James C. ; Chuah, Edward ; Parashar, Manish

Author_Institution :

RDI2, State Univ. of New Jersey, Rutgers, NJ, USA

fYear :

2014

Firstpage :

Lastpage :

Abstract :

Ensuring high reliability of large-scale clusters is becoming more critical as the size of these machines continues to grow, since this increases the complexity and amount of interactions between different nodes and thus results in a high failure frequency. For this reason, predicting node failures in order to prevent errors from happening in the first place has become extremely valuable. A common approach for failure prediction is to analyze traces of system events to find correlations between event types or anomalous event patterns and node failures, and to use the types or patterns identified as failure predictors at run-time. However, typical centralized solutions for failure prediction in this manner suffer from high transmission and processing overheads at very large scales. We present a solution to the problem of predicting compute node soft-lockups in large scale clusters by using a decentralized online clustering algorithm (DOC) to detect anomalies in resource usage logs, which have been shown to correlate to particular types of node failures in supercomputer clusters. We demonstrate the effectiveness of this system by using the monitoring logs from the Ranger supercomputer at Texas Advanced Computing Center. Experiments shows that this approach can achieve similar accuracy as other related approaches, while maintaining low RAM and bandwidth usage, with a runtime impact to current running applications of less than 2%.

Keywords :

failure analysis; mainframes; parallel processing; pattern clustering; DOC algorithm; HPC resources; Ranger supercomputer; Texas advanced computing center; bandwidth usage; compute node soft-lockup prediction; decentralized online clustering algorithm; large-scale cluster reliability; node failures; online failure prediction; supercomputer clusters; Accuracy; Clustering algorithms; Data mining; Distributed databases; Monitoring; Prediction algorithms; Supercomputers; Clustering; Failure prediction; HPC; Large-scale systems; Monitoring;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

High Performance Computing (HiPC), 2014 21st International Conference on

Print_ISBN :

978-1-4799-5975-4

Type :

conf

DOI :

10.1109/HiPC.2014.7116903

Filename :

7116903

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3591182