• DocumentCode
    1971744
  • Title

    BlueGene/L Failure Analysis and Prediction Models

  • Author

    Liang, Yinglung ; Zhang, Yanyong ; Jette, Morris ; Sivasubramaniam, Anand ; Sahoo, Ramendra

  • Author_Institution
    Dept. of Electr. & Comput. Eng., Rutgers Univ., Piscataway, NJ
  • fYear
    2006
  • fDate
    25-28 June 2006
  • Firstpage
    425
  • Lastpage
    434
  • Abstract
    The growing computational and storage needs of several scientific applications mandate the deployment of extreme-scale parallel machines, such as IBM´s BlueGene/L which can accommodate as many as 128 K processors. One of the challenges when designing and deploying these systems in a production setting is the need to take failure occurrences, whether it be in the hardware or in the software, into account. Earlier work has shown that conventional runtime fault-tolerant techniques such as periodic checkpointing are not effective to the emerging systems. Instead, the ability to predict failure occurrences can help develop more effective checkpointing strategies. Failure prediction has long been regarded as a challenging research problem, mainly due to the lack of realistic failure data from actual production systems. In this study, we have collected RAS event logs from BlueGene/L over a period of more than 100 days. We have investigated the characteristics of fatal failure events, as well as the correlation between fatal events and non-fatal events. Based on the observations, we have developed three simple yet effective failure prediction methods, which can predict around 80% of the memory and network failures, and 47% of the application I/O failures
  • Keywords
    checkpointing; parallel machines; BlueGene/L failure analysis; RAS event logs; checkpointing strategies; failure prediction models; parallel machines; Checkpointing; Concurrent computing; Failure analysis; Fault tolerant systems; Hardware; Parallel machines; Prediction methods; Predictive models; Production systems; Runtime;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Dependable Systems and Networks, 2006. DSN 2006. International Conference on
  • Conference_Location
    Philadelphia, PA
  • Print_ISBN
    0-7695-2607-1
  • Type

    conf

  • DOI
    10.1109/DSN.2006.18
  • Filename
    1633531