• DocumentCode
    2859177
  • Title

    Establishing Hypothesis for Recurrent System Failures from Cluster Log Files

  • Author

    Chuah, Edward ; Lee, Gary ; Tjhi, William-Chandra ; Kuo, Shyh-hao ; Hung, Terence ; Hammond, John ; Minyard, Tommy ; Browne, James C.

  • Author_Institution
    Inst. of High Performance Comput., Singapore, Singapore
  • fYear
    2011
  • fDate
    12-14 Dec. 2011
  • Firstpage
    15
  • Lastpage
    22
  • Abstract
    A goal for the analysis of supercomputer logs is to establish causal relationships among events which reflect significant state changes in the system. Establishing these relationships is at the heart of failure diagnosis. In principle, a log analysis tool could automate many of the manual steps systems administrators must currently use to diagnose system failures. However, supercomputer logs are unstructured, incomplete and contain considerable ambiguity so that direct discovery of causal relationships is difficult. This paper describes the second generation FDiag log-based failure diagnostics framework that provides automation of the manual failure diagnosis process and determines with high confidence, the likely cause of the failure, the components involved and the event sequences which contain the times of the causal and terminal events. FDiag extracts relevant events from the system logs, performs correlation analysis on these events and from these correlations determines the components involved and the event sequences. The diagnostics capabilities of FDiag are validated by comparing its assessments on known instances of recurrent failures on the Ranger supercomputer at the University of Texas at Austin. We believe FDiag is the first log analyzer to demonstrate this level of diagnostics capability from the system logs of an open source software stack incorporating Linux and the Lustre file system. FDiag will be put into production use for support of failure diagnosis on Ranger in September, 2011.
  • Keywords
    fault diagnosis; parallel machines; system monitoring; FDiag log-based failure diagnostics framework; Linux; Lustre file system; Ranger supercomputer; cluster log files; correlation analysis; failure diagnosis; log analysis tool; open source software; recurrent system failures; supercomputer logs; Correlation; Data mining; Laser mode locking; Manuals; Protocols; Servers; Supercomputers; Failure diagnosis; Hypothesis testing; Large cluster systems; Reliability; Syslogs;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Dependable, Autonomic and Secure Computing (DASC), 2011 IEEE Ninth International Conference on
  • Conference_Location
    Sydney, NSW
  • Print_ISBN
    978-1-4673-0006-3
  • Type

    conf

  • DOI
    10.1109/DASC.2011.27
  • Filename
    6118346