DocumentCode :
1507087
Title :
Assessment of the Impact of Cosmic-Ray-Induced Neutrons on Hardware in the Roadrunner Supercomputer
Author :
Michalak, Sarah E. ; DuBois, Andrew J. ; Storlie, Curtis B. ; Quinn, Heather M. ; Rust, William N. ; DuBois, David H. ; Modl, David G. ; Manuzzato, Andrea ; Blanchard, Sean P.
Author_Institution :
Stat. Sci. Group, Los Alamos Nat. Lab., Los Alamos, NM, USA
Volume :
12
Issue :
2
fYear :
2012
fDate :
6/1/2012 12:00:00 AM
Firstpage :
445
Lastpage :
454
Abstract :
Microprocessor-based systems are a common design for high-performance computing (HPC) platforms. In these systems, several thousands of microprocessors can participate in a single calculation that may take weeks or months to complete. When used in this manner, a fault in any of the microprocessors could cause the computation to crash or cause silent data corruption (SDC), i.e., computationally incorrect results that originate from an undetected fault. In recent years, neutron-induced effects in HPC hardware have been observed, and researchers have started to study how neutrons impact microprocessor-based computations. This paper presents results from an accelerated neutron-beam test focusing on two microprocessors used in Roadrunner, which is the first petaflop supercomputer. Research questions of interest include whether the application running affects neutron susceptibility and whether different replicates of the hardware under test have different susceptibilities to neutrons. Estimated failures in time for crashes and for SDC are presented for the hardware under test, for the Triblade servers used for computation in Roadrunner, and for Roadrunner.
Keywords :
cosmic rays; mainframes; parallel machines; HPC; SDC; Triblade servers; hardware cosmic ray induced neutrons; high performance computing; microprocessor based computations; microprocessor based systems; neutron susceptibility; petaflop supercomputer; roadrunner supercomputer; silent data corruption; Blades; Computer architecture; Hardware; Microprocessors; Neutrons; Program processors; Testing; Failures in time (FIT); neutron-beam testing; silent data corruption (SDC); single-event effect; soft error;
fLanguage :
English
Journal_Title :
Device and Materials Reliability, IEEE Transactions on
Publisher :
ieee
ISSN :
1530-4388
Type :
jour
DOI :
10.1109/TDMR.2012.2192736
Filename :
6193419
Link To Document :
بازگشت