Title :
Soft error resiliency characterization and improvement on IBM BlueGene/Q processor using accelerated proton irradiation
Author :
Chen-Yong Cher ; Muller, K. Paul ; Haring, Ruud A. ; Satterfield, David L. ; Musta, Thomas E. ; Gooding, Thomas M. ; Davis, Kristan D. ; Dombrowa, Marc B. ; Kopcsay, Gerard V. ; Senger, Robert M. ; Sugawara, Yutaka ; Sugavanam, Krishnan
Author_Institution :
IBM Res., Yorktown Heights, NY, USA
Abstract :
Fault injection through accelerated irradiation is an effective way to evaluate the overall soft error resiliency of microprocessors. In this work, we report on irradiation experiments on a Blue Gene/Q (BG/Q) compute processor chip running selected applications. Blue Gene/Q is the third generation of IBM´s massively parallel, energy efficient Blue Gene series of supercomputers. In the experiments, we found 69 code fails. Out of these, 26 code fails are relevant for the calculation of the mean-time-between-failures (MTBF) for a 20 PetaFLOP, 96 rack system running a comparable workload mix. The expected MTBF for check-stops due to cosmic radiation and alpha particles from chip packaging materials is calculated to be 51 days for sea-level at New York City running the application mix studied. If the most vulnerable application is run exclusively, the projected MTBF is 35 days. These are outstanding results for a machine of this magnitude. The beaming experiment and projected MTBF validate the necessity to include autonomous hardware detection and recovery at the cost of design effort, silicon area and power.
Keywords :
chip scale packaging; mainframes; microprocessor chips; parallel machines; radiation hardening (electronics); 20 PetaFLOP; 96 rack system; BG/Q compute processor chip; IBM BlueGene; IBM Q processor; MTBF; New York City; accelerated irradiation; accelerated proton irradiation; autonomous hardware detection; autonomous hardware recovery; chip packaging materials; fault injection; mean-time-between-failures; microprocessors soft error resiliency characterization; supercomputers; Circuit faults; Hardware; Neutrons; Packaging; Particle beams; Radiation effects; System-on-chip;
Conference_Titel :
Test Conference (ITC), 2014 IEEE International
Conference_Location :
Seattle, WA
DOI :
10.1109/TEST.2014.7035317