Title :
Assessment of the Impact of Cosmic-Ray-Induced Neutrons on Hardware in the Roadrunner Supercomputer
Author :
Michalak, Sarah E. ; DuBois, Andrew J. ; Storlie, Curtis B. ; Quinn, Heather M. ; Rust, William N. ; DuBois, David H. ; Modl, David G. ; Manuzzato, Andrea ; Blanchard, Sean P.
Author_Institution :
Stat. Sci. Group, Los Alamos Nat. Lab., Los Alamos, NM, USA
fDate :
6/1/2012 12:00:00 AM
Abstract :
Microprocessor-based systems are a common design for high-performance computing (HPC) platforms. In these systems, several thousands of microprocessors can participate in a single calculation that may take weeks or months to complete. When used in this manner, a fault in any of the microprocessors could cause the computation to crash or cause silent data corruption (SDC), i.e., computationally incorrect results that originate from an undetected fault. In recent years, neutron-induced effects in HPC hardware have been observed, and researchers have started to study how neutrons impact microprocessor-based computations. This paper presents results from an accelerated neutron-beam test focusing on two microprocessors used in Roadrunner, which is the first petaflop supercomputer. Research questions of interest include whether the application running affects neutron susceptibility and whether different replicates of the hardware under test have different susceptibilities to neutrons. Estimated failures in time for crashes and for SDC are presented for the hardware under test, for the Triblade servers used for computation in Roadrunner, and for Roadrunner.
Keywords :
cosmic rays; mainframes; parallel machines; HPC; SDC; Triblade servers; hardware cosmic ray induced neutrons; high performance computing; microprocessor based computations; microprocessor based systems; neutron susceptibility; petaflop supercomputer; roadrunner supercomputer; silent data corruption; Blades; Computer architecture; Hardware; Microprocessors; Neutrons; Program processors; Testing; Failures in time (FIT); neutron-beam testing; silent data corruption (SDC); single-event effect; soft error;
Journal_Title :
Device and Materials Reliability, IEEE Transactions on
DOI :
10.1109/TDMR.2012.2192736