مرکز منطقه ای اطلاع رساني علوم و فناوري - Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems

DocumentCode :

2959797

Title :

Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems

Author :

Gainaru, Ana ; Cappello, Franck ; Kramer, William

Author_Institution :

Comput. Sci. Dept., UIUC, Urbana, IL, USA

fYear :

2012

fDate :

21-25 May 2012

Firstpage :

1168

Lastpage :

1179

Abstract :

HPC systems are complex machines that generate a huge volume of system state data called "events". Events are generated without following a general consistent rule and different hardware and software components of such systems have different failure rates. Distinguishing between normal system behaviour and faulty situation relies on event analysis. Being able to detect quickly deviations from normality is essential for system administration and is the foundation of fault prediction. As HPC systems continue to grow in size and complexity, mining event flows become more challenging and with the upcoming 10 Pet flop systems, there is a lot of interest in this topic. Current event mining approaches do not take into consideration the specific behaviour of each type of events and as a consequence, fail to analyze them according to their characteristics. In this paper we propose a novel way of characterizing the normal and faulty behaviour of the system by using signal analysis concepts. All analysis modules create ELSA (Event Log Signal Analyzer), a toolkit that has the purpose of modelling the normal flow of each state event during a HPC system lifetime, and how it is affected when a failure hits the system. We show that these extracted models provide an accurate view of the system output, which improves the effectiveness of proactive fault tolerance algorithms. Specifically, we implemented a filtering algorithm and short-term fault prediction methodology based on the extracted model and test it against real failure traces from a large-scale system. We show that by analyzing each event according to its specific behaviour, we get a more realistic overview of the entire system.

Keywords :

data mining; failure analysis; large-scale systems; parallel processing; prediction theory; program diagnostics; signal processing; software fault tolerance; 10 PetaHop systems; ELSA; complex machines; event analysis; event flows mining; event log signal analyzer; event mining approaches; events generation; failure rates; fault tolerance algorithms; faulty behaviour; filtering algorithm; hardware components; large-scale HPC systems; normal system behaviour; real failure tracing; short-term fault prediction methodology; signal analysis; software components; system administration; system state data generation; Analytical models; Correlation; Data mining; Large-scale systems; Prediction algorithms; Predictive models; Signal analysis; fault detection; fault tolerance; large-scale HPC systems; signal analysis;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International

Conference_Location :

Shanghai

ISSN :

1530-2075

Print_ISBN :

978-1-4673-0975-2

Type :

conf

DOI :

10.1109/IPDPS.2012.107

Filename :

6267920

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2959797