DocumentCode
1507087
Title
Assessment of the Impact of Cosmic-Ray-Induced Neutrons on Hardware in the Roadrunner Supercomputer
Author
Michalak, Sarah E. ; DuBois, Andrew J. ; Storlie, Curtis B. ; Quinn, Heather M. ; Rust, William N. ; DuBois, David H. ; Modl, David G. ; Manuzzato, Andrea ; Blanchard, Sean P.
Author_Institution
Stat. Sci. Group, Los Alamos Nat. Lab., Los Alamos, NM, USA
Volume
12
Issue
2
fYear
2012
fDate
6/1/2012 12:00:00 AM
Firstpage
445
Lastpage
454
Abstract
Microprocessor-based systems are a common design for high-performance computing (HPC) platforms. In these systems, several thousands of microprocessors can participate in a single calculation that may take weeks or months to complete. When used in this manner, a fault in any of the microprocessors could cause the computation to crash or cause silent data corruption (SDC), i.e., computationally incorrect results that originate from an undetected fault. In recent years, neutron-induced effects in HPC hardware have been observed, and researchers have started to study how neutrons impact microprocessor-based computations. This paper presents results from an accelerated neutron-beam test focusing on two microprocessors used in Roadrunner, which is the first petaflop supercomputer. Research questions of interest include whether the application running affects neutron susceptibility and whether different replicates of the hardware under test have different susceptibilities to neutrons. Estimated failures in time for crashes and for SDC are presented for the hardware under test, for the Triblade servers used for computation in Roadrunner, and for Roadrunner.
Keywords
cosmic rays; mainframes; parallel machines; HPC; SDC; Triblade servers; hardware cosmic ray induced neutrons; high performance computing; microprocessor based computations; microprocessor based systems; neutron susceptibility; petaflop supercomputer; roadrunner supercomputer; silent data corruption; Blades; Computer architecture; Hardware; Microprocessors; Neutrons; Program processors; Testing; Failures in time (FIT); neutron-beam testing; silent data corruption (SDC); single-event effect; soft error;
fLanguage
English
Journal_Title
Device and Materials Reliability, IEEE Transactions on
Publisher
ieee
ISSN
1530-4388
Type
jour
DOI
10.1109/TDMR.2012.2192736
Filename
6193419
Link To Document