DocumentCode
710649
Title
Field, experimental, and analytical data on large-scale HPC systems and evaluation of the implications for exascale system design
Author
DeBardeleben, Nathan ; Blanchard, Sean ; Kaeli, David ; Rech, Paolo
Author_Institution
Ultrascale Syst. Res. Center, Los Alamos Nat. Lab., Los Alamos, NM, USA
fYear
2015
fDate
27-29 April 2015
Firstpage
1
Lastpage
2
Abstract
Reliability is an issue for today´s large scale computing systems designers, producers, and users. As we approach exascale, the resilience challenge will become critical due to increase in system-scale. It is then fundamental to understand the nature of errors, evaluate their probability of occurrence, and improve the design to reduce their impact on the overall system. In the paper we will present experimental, field, and analytical data to characterize and quantify errors on accelerators, providing a thorough understanding of errors impact on today and future large-scale systems.
Keywords
large-scale systems; parallel processing; software reliability; HPC system; analytical data; exascale system design; experimental data; field data; large-scale system; system reliability; Computer architecture; Graphics processing units; Hardware; Reliability engineering; Resilience;
fLanguage
English
Publisher
ieee
Conference_Titel
VLSI Test Symposium (VTS), 2015 IEEE 33rd
Conference_Location
Napa, CA
Type
conf
DOI
10.1109/VTS.2015.7116295
Filename
7116295
Link To Document