DocumentCode :
710649
Title :
Field, experimental, and analytical data on large-scale HPC systems and evaluation of the implications for exascale system design
Author :
DeBardeleben, Nathan ; Blanchard, Sean ; Kaeli, David ; Rech, Paolo
Author_Institution :
Ultrascale Syst. Res. Center, Los Alamos Nat. Lab., Los Alamos, NM, USA
fYear :
2015
fDate :
27-29 April 2015
Firstpage :
1
Lastpage :
2
Abstract :
Reliability is an issue for today´s large scale computing systems designers, producers, and users. As we approach exascale, the resilience challenge will become critical due to increase in system-scale. It is then fundamental to understand the nature of errors, evaluate their probability of occurrence, and improve the design to reduce their impact on the overall system. In the paper we will present experimental, field, and analytical data to characterize and quantify errors on accelerators, providing a thorough understanding of errors impact on today and future large-scale systems.
Keywords :
large-scale systems; parallel processing; software reliability; HPC system; analytical data; exascale system design; experimental data; field data; large-scale system; system reliability; Computer architecture; Graphics processing units; Hardware; Reliability engineering; Resilience;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
VLSI Test Symposium (VTS), 2015 IEEE 33rd
Conference_Location :
Napa, CA
Type :
conf
DOI :
10.1109/VTS.2015.7116295
Filename :
7116295
Link To Document :
بازگشت