Title :
Establishing Hypothesis for Recurrent System Failures from Cluster Log Files
Author :
Chuah, Edward ; Lee, Gary ; Tjhi, William-Chandra ; Kuo, Shyh-hao ; Hung, Terence ; Hammond, John ; Minyard, Tommy ; Browne, James C.
Author_Institution :
Inst. of High Performance Comput., Singapore, Singapore
Abstract :
A goal for the analysis of supercomputer logs is to establish causal relationships among events which reflect significant state changes in the system. Establishing these relationships is at the heart of failure diagnosis. In principle, a log analysis tool could automate many of the manual steps systems administrators must currently use to diagnose system failures. However, supercomputer logs are unstructured, incomplete and contain considerable ambiguity so that direct discovery of causal relationships is difficult. This paper describes the second generation FDiag log-based failure diagnostics framework that provides automation of the manual failure diagnosis process and determines with high confidence, the likely cause of the failure, the components involved and the event sequences which contain the times of the causal and terminal events. FDiag extracts relevant events from the system logs, performs correlation analysis on these events and from these correlations determines the components involved and the event sequences. The diagnostics capabilities of FDiag are validated by comparing its assessments on known instances of recurrent failures on the Ranger supercomputer at the University of Texas at Austin. We believe FDiag is the first log analyzer to demonstrate this level of diagnostics capability from the system logs of an open source software stack incorporating Linux and the Lustre file system. FDiag will be put into production use for support of failure diagnosis on Ranger in September, 2011.
Keywords :
fault diagnosis; parallel machines; system monitoring; FDiag log-based failure diagnostics framework; Linux; Lustre file system; Ranger supercomputer; cluster log files; correlation analysis; failure diagnosis; log analysis tool; open source software; recurrent system failures; supercomputer logs; Correlation; Data mining; Laser mode locking; Manuals; Protocols; Servers; Supercomputers; Failure diagnosis; Hypothesis testing; Large cluster systems; Reliability; Syslogs;
Conference_Titel :
Dependable, Autonomic and Secure Computing (DASC), 2011 IEEE Ninth International Conference on
Conference_Location :
Sydney, NSW
Print_ISBN :
978-1-4673-0006-3
DOI :
10.1109/DASC.2011.27