• DocumentCode
    1640667
  • Title

    Bad Words: Finding Faults in Spirit´s Syslogs

  • Author

    Stearley, Jon ; Oliner, Adam J.

  • Author_Institution
    Sandia Nat. Labs., Albuquerque, NM
  • fYear
    2008
  • Firstpage
    765
  • Lastpage
    770
  • Abstract
    Accurate fault detection is a key element of resilient computing. Syslogs provide key information regarding faults, and are found on nearly all computing systems. Discovering new fault types requires expert human effort, however, as no previous algorithm has been shown to localize faults in time and space with an operationally acceptable false positive rate. We present experiments on three weeks of syslogs from Sandia\´s 512-node "Spirit" Linux cluster, showing one algorithm that localizes 50% of faults with 75% precision, corresponding to an excellent false positive rate of 0.05%. The salient characteristics of this algorithm are (1) calculation of nodewise information entropy, and (2) encoding of word position. The key observation is that similar computers correctly executing similar work should produce similar logs.
  • Keywords
    Linux; entropy; software fault tolerance; system monitoring; Spirit Linux cluster; bad words; false positive rate; fault detection; information entropy; syslogs; Clustering algorithms; Computer science; Fault detection; Grid computing; Humans; Laboratories; Monitoring; Programming profession; Supercomputers; USA Councils;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cluster Computing and the Grid, 2008. CCGRID '08. 8th IEEE International Symposium on
  • Conference_Location
    Lyon
  • Print_ISBN
    978-0-7695-3156-4
  • Electronic_ISBN
    978-0-7695-3156-4
  • Type

    conf

  • DOI
    10.1109/CCGRID.2008.107
  • Filename
    4534301