• DocumentCode
    710649
  • Title

    Field, experimental, and analytical data on large-scale HPC systems and evaluation of the implications for exascale system design

  • Author

    DeBardeleben, Nathan ; Blanchard, Sean ; Kaeli, David ; Rech, Paolo

  • Author_Institution
    Ultrascale Syst. Res. Center, Los Alamos Nat. Lab., Los Alamos, NM, USA
  • fYear
    2015
  • fDate
    27-29 April 2015
  • Firstpage
    1
  • Lastpage
    2
  • Abstract
    Reliability is an issue for today´s large scale computing systems designers, producers, and users. As we approach exascale, the resilience challenge will become critical due to increase in system-scale. It is then fundamental to understand the nature of errors, evaluate their probability of occurrence, and improve the design to reduce their impact on the overall system. In the paper we will present experimental, field, and analytical data to characterize and quantify errors on accelerators, providing a thorough understanding of errors impact on today and future large-scale systems.
  • Keywords
    large-scale systems; parallel processing; software reliability; HPC system; analytical data; exascale system design; experimental data; field data; large-scale system; system reliability; Computer architecture; Graphics processing units; Hardware; Reliability engineering; Resilience;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    VLSI Test Symposium (VTS), 2015 IEEE 33rd
  • Conference_Location
    Napa, CA
  • Type

    conf

  • DOI
    10.1109/VTS.2015.7116295
  • Filename
    7116295