DocumentCode
2859177
Title
Establishing Hypothesis for Recurrent System Failures from Cluster Log Files
Author
Chuah, Edward ; Lee, Gary ; Tjhi, William-Chandra ; Kuo, Shyh-hao ; Hung, Terence ; Hammond, John ; Minyard, Tommy ; Browne, James C.
Author_Institution
Inst. of High Performance Comput., Singapore, Singapore
fYear
2011
fDate
12-14 Dec. 2011
Firstpage
15
Lastpage
22
Abstract
A goal for the analysis of supercomputer logs is to establish causal relationships among events which reflect significant state changes in the system. Establishing these relationships is at the heart of failure diagnosis. In principle, a log analysis tool could automate many of the manual steps systems administrators must currently use to diagnose system failures. However, supercomputer logs are unstructured, incomplete and contain considerable ambiguity so that direct discovery of causal relationships is difficult. This paper describes the second generation FDiag log-based failure diagnostics framework that provides automation of the manual failure diagnosis process and determines with high confidence, the likely cause of the failure, the components involved and the event sequences which contain the times of the causal and terminal events. FDiag extracts relevant events from the system logs, performs correlation analysis on these events and from these correlations determines the components involved and the event sequences. The diagnostics capabilities of FDiag are validated by comparing its assessments on known instances of recurrent failures on the Ranger supercomputer at the University of Texas at Austin. We believe FDiag is the first log analyzer to demonstrate this level of diagnostics capability from the system logs of an open source software stack incorporating Linux and the Lustre file system. FDiag will be put into production use for support of failure diagnosis on Ranger in September, 2011.
Keywords
fault diagnosis; parallel machines; system monitoring; FDiag log-based failure diagnostics framework; Linux; Lustre file system; Ranger supercomputer; cluster log files; correlation analysis; failure diagnosis; log analysis tool; open source software; recurrent system failures; supercomputer logs; Correlation; Data mining; Laser mode locking; Manuals; Protocols; Servers; Supercomputers; Failure diagnosis; Hypothesis testing; Large cluster systems; Reliability; Syslogs;
fLanguage
English
Publisher
ieee
Conference_Titel
Dependable, Autonomic and Secure Computing (DASC), 2011 IEEE Ninth International Conference on
Conference_Location
Sydney, NSW
Print_ISBN
978-1-4673-0006-3
Type
conf
DOI
10.1109/DASC.2011.27
Filename
6118346
Link To Document