DocumentCode :
2959797
Title :
Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems
Author :
Gainaru, Ana ; Cappello, Franck ; Kramer, William
Author_Institution :
Comput. Sci. Dept., UIUC, Urbana, IL, USA
fYear :
2012
fDate :
21-25 May 2012
Firstpage :
1168
Lastpage :
1179
Abstract :
HPC systems are complex machines that generate a huge volume of system state data called "events". Events are generated without following a general consistent rule and different hardware and software components of such systems have different failure rates. Distinguishing between normal system behaviour and faulty situation relies on event analysis. Being able to detect quickly deviations from normality is essential for system administration and is the foundation of fault prediction. As HPC systems continue to grow in size and complexity, mining event flows become more challenging and with the upcoming 10 Pet flop systems, there is a lot of interest in this topic. Current event mining approaches do not take into consideration the specific behaviour of each type of events and as a consequence, fail to analyze them according to their characteristics. In this paper we propose a novel way of characterizing the normal and faulty behaviour of the system by using signal analysis concepts. All analysis modules create ELSA (Event Log Signal Analyzer), a toolkit that has the purpose of modelling the normal flow of each state event during a HPC system lifetime, and how it is affected when a failure hits the system. We show that these extracted models provide an accurate view of the system output, which improves the effectiveness of proactive fault tolerance algorithms. Specifically, we implemented a filtering algorithm and short-term fault prediction methodology based on the extracted model and test it against real failure traces from a large-scale system. We show that by analyzing each event according to its specific behaviour, we get a more realistic overview of the entire system.
Keywords :
data mining; failure analysis; large-scale systems; parallel processing; prediction theory; program diagnostics; signal processing; software fault tolerance; 10 PetaHop systems; ELSA; complex machines; event analysis; event flows mining; event log signal analyzer; event mining approaches; events generation; failure rates; fault tolerance algorithms; faulty behaviour; filtering algorithm; hardware components; large-scale HPC systems; normal system behaviour; real failure tracing; short-term fault prediction methodology; signal analysis; software components; system administration; system state data generation; Analytical models; Correlation; Data mining; Large-scale systems; Prediction algorithms; Predictive models; Signal analysis; fault detection; fault tolerance; large-scale HPC systems; signal analysis;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International
Conference_Location :
Shanghai
ISSN :
1530-2075
Print_ISBN :
978-1-4673-0975-2
Type :
conf
DOI :
10.1109/IPDPS.2012.107
Filename :
6267920
Link To Document :
بازگشت