• DocumentCode
    1507087
  • Title

    Assessment of the Impact of Cosmic-Ray-Induced Neutrons on Hardware in the Roadrunner Supercomputer

  • Author

    Michalak, Sarah E. ; DuBois, Andrew J. ; Storlie, Curtis B. ; Quinn, Heather M. ; Rust, William N. ; DuBois, David H. ; Modl, David G. ; Manuzzato, Andrea ; Blanchard, Sean P.

  • Author_Institution
    Stat. Sci. Group, Los Alamos Nat. Lab., Los Alamos, NM, USA
  • Volume
    12
  • Issue
    2
  • fYear
    2012
  • fDate
    6/1/2012 12:00:00 AM
  • Firstpage
    445
  • Lastpage
    454
  • Abstract
    Microprocessor-based systems are a common design for high-performance computing (HPC) platforms. In these systems, several thousands of microprocessors can participate in a single calculation that may take weeks or months to complete. When used in this manner, a fault in any of the microprocessors could cause the computation to crash or cause silent data corruption (SDC), i.e., computationally incorrect results that originate from an undetected fault. In recent years, neutron-induced effects in HPC hardware have been observed, and researchers have started to study how neutrons impact microprocessor-based computations. This paper presents results from an accelerated neutron-beam test focusing on two microprocessors used in Roadrunner, which is the first petaflop supercomputer. Research questions of interest include whether the application running affects neutron susceptibility and whether different replicates of the hardware under test have different susceptibilities to neutrons. Estimated failures in time for crashes and for SDC are presented for the hardware under test, for the Triblade servers used for computation in Roadrunner, and for Roadrunner.
  • Keywords
    cosmic rays; mainframes; parallel machines; HPC; SDC; Triblade servers; hardware cosmic ray induced neutrons; high performance computing; microprocessor based computations; microprocessor based systems; neutron susceptibility; petaflop supercomputer; roadrunner supercomputer; silent data corruption; Blades; Computer architecture; Hardware; Microprocessors; Neutrons; Program processors; Testing; Failures in time (FIT); neutron-beam testing; silent data corruption (SDC); single-event effect; soft error;
  • fLanguage
    English
  • Journal_Title
    Device and Materials Reliability, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1530-4388
  • Type

    jour

  • DOI
    10.1109/TDMR.2012.2192736
  • Filename
    6193419