DocumentCode :
1244042
Title :
Predicting the number of fatal soft errors in Los Alamos national laboratory´s ASC Q supercomputer
Author :
Michalak, Sarah E. ; Harris, Kevin W. ; Hengartner, Nicolas W. ; Takala, Bruce E. ; Wender, Stephen A.
Author_Institution :
Stat. Sci. Group, Los Alamos Nat. Lab., NM, USA
Volume :
5
Issue :
3
fYear :
2005
Firstpage :
329
Lastpage :
335
Abstract :
Early in the deployment of the Advanced Simulation and Computing (ASC) Q supercomputer, a higher-than-expected number of single-node failures was observed. The elevated rate of single-node failures was hypothesized to be caused primarily by fatal soft errors, i.e., board-level cache (B-cache) tag (BTAG) parity errors caused by cosmic-ray-induced neutrons that led to node crashes. A series of experiments was undertaken at the Los Alamos Neutron Science Center (LANSCE) to ascertain whether fatal soft errors were indeed the primary cause of the elevated rate of single-node failures. Observed failure data from Q are consistent with the results from some of these experiments. Mitigation strategies have been developed, and scientists successfully use Q for large computations in the presence of fatal soft errors and other single-node failures.
Keywords :
SRAM chips; cosmic ray neutrons; error correction codes; error detection codes; failure analysis; fault tolerant computing; integrated circuit testing; mainframes; neutron effects; parallel machines; parity check codes; semiconductor device testing; ASC Q supercomputer; Los Alamos National Laboratory; SRAM chips; board level cache tag parity errors; cosmic ray induced neutrons; failure analysis; fatal soft errors; linear accelerators; memory testing; neutron beam; neutron radiation effects; node crashes; semiconductor device radiation effects; semiconductor device testing; single event upset; single node failures; soft error rate; Computational modeling; Computer errors; Error correction codes; Laboratories; Life testing; Neutrons; Random access memory; Runtime; Semiconductor device testing; Supercomputers; Cosmic-ray-induced neutron; life estimation; linear accelerators; memory testing; neutron beam; neutron radiation effects; neutron-induced soft error; semiconductor-device radiation effects; semiconductor-device testing; single-event upset; soft-error rate; static random access memory (SRAM) chips;
fLanguage :
English
Journal_Title :
Device and Materials Reliability, IEEE Transactions on
Publisher :
ieee
ISSN :
1530-4388
Type :
jour
DOI :
10.1109/TDMR.2005.855685
Filename :
1545893
Link To Document :
بازگشت