• DocumentCode
    1967196
  • Title

    Module Prototype for Online Failure Prediction for the IBM Blue Gene/L

  • Author

    Solano-Quinde, Lizandro D. ; Bode, Brett M.

  • Author_Institution
    Ames Lab, Scalable Comput. Lab., Iowa State Univ., Ames, IA
  • fYear
    2008
  • fDate
    18-20 May 2008
  • Firstpage
    470
  • Lastpage
    474
  • Abstract
    The growing complexity of scientific applications has led to the design and deployment of large-scale parallel systems. The IBM Blue Gene/L can hold in excess of 200 K processors and it has been designed for high performance and reliability. However, failures in this large-scale parallel system are a major concern, since it has been demonstrated that a failure will significantly reduce the performance of the system.
  • Keywords
    fault tolerant computing; parallel machines; system recovery; IBM Blue Gene/L; fault tolerance; large-scale parallel systems; online failure prediction; Checkpointing; Degradation; Fault tolerance; Fault tolerant systems; Information analysis; Large-scale systems; Pattern matching; Prototypes; Software prototyping; System performance; Blue Gene/L; Computer Fault Tolerance; Failure Analysis; Software Fault Tolerance;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Electro/Information Technology, 2008. EIT 2008. IEEE International Conference on
  • Conference_Location
    Ames, IA
  • Print_ISBN
    978-1-4244-2029-2
  • Electronic_ISBN
    978-1-4244-2030-8
  • Type

    conf

  • DOI
    10.1109/EIT.2008.4554349
  • Filename
    4554349