Title :
Diagnosing the root-causes of failures from cluster log files
Author :
Chuah, Edward ; Kuo, Shyh-hao ; Hiew, Paul ; Tjhi, William-Chandra ; Lee, Gary ; Hammond, John ; Michalewicz, Marek T. ; Hung, Terence ; Browne, James C.
Author_Institution :
Inst. of High Performance Comput., Singapore, Singapore
Abstract :
System event logs are often the primary source of information for diagnosing (and predicting) the causes of failures for cluster systems. Due to interactions among the system hardware and software components, the system event logs for large cluster systems are comprised of streams of interleaved events, and only a small fraction of the events over a small time span are relevant to the diagnosis of a given failure. Furthermore, the process of troubleshooting the causes of failures is largely manual and ad-hoc. In this paper, we present a systematic methodology for reconstructing event order and establishing correlations among events which indicate the root-causes of a given failure from very large syslogs. We developed a diagnostics tool, FDiag, to extract the log entries as structured message templates and uses statistical correlation analysis to establish probable cause and effect relationships for the fault being analyzed. We applied FDiag to analyze failures due to breakdowns in interactions between the Lustre file system and its clients on the Ranger supercomputer at the Texas Advanced Computing Center (TACC). The results are positive. FDiag is able to identify the dates and the time periods that contain the significant events which eventually led to the occurrence of compute node soft lockups.
Keywords :
correlation methods; file organisation; mainframes; parallel machines; pattern clustering; statistical analysis; system recovery; FDiag; Texas advanced computing center; cluster log file; cluster system failure; lustre file system; ranger supercomputer; root cause; software component; statistical correlation analysis; system event log; system hardware; Correlation; Correlators; Data mining; Heating; Kernel; Manuals; Supercomputers; Reliability; Resilient cluster systems; Statistical correlation analysis; Syslog files;
Conference_Titel :
High Performance Computing (HiPC), 2010 International Conference on
Conference_Location :
Dona Paula
Print_ISBN :
978-1-4244-8518-5
Electronic_ISBN :
978-1-4244-8519-2
DOI :
10.1109/HIPC.2010.5713159