DocumentCode
1967196
Title
Module Prototype for Online Failure Prediction for the IBM Blue Gene/L
Author
Solano-Quinde, Lizandro D. ; Bode, Brett M.
Author_Institution
Ames Lab, Scalable Comput. Lab., Iowa State Univ., Ames, IA
fYear
2008
fDate
18-20 May 2008
Firstpage
470
Lastpage
474
Abstract
The growing complexity of scientific applications has led to the design and deployment of large-scale parallel systems. The IBM Blue Gene/L can hold in excess of 200 K processors and it has been designed for high performance and reliability. However, failures in this large-scale parallel system are a major concern, since it has been demonstrated that a failure will significantly reduce the performance of the system.
Keywords
fault tolerant computing; parallel machines; system recovery; IBM Blue Gene/L; fault tolerance; large-scale parallel systems; online failure prediction; Checkpointing; Degradation; Fault tolerance; Fault tolerant systems; Information analysis; Large-scale systems; Pattern matching; Prototypes; Software prototyping; System performance; Blue Gene/L; Computer Fault Tolerance; Failure Analysis; Software Fault Tolerance;
fLanguage
English
Publisher
ieee
Conference_Titel
Electro/Information Technology, 2008. EIT 2008. IEEE International Conference on
Conference_Location
Ames, IA
Print_ISBN
978-1-4244-2029-2
Electronic_ISBN
978-1-4244-2030-8
Type
conf
DOI
10.1109/EIT.2008.4554349
Filename
4554349
Link To Document