• DocumentCode
    640444
  • Title

    GPUburn: A system to test and mitigate GPU hardware failures

  • Author

    Defour, David ; Petit, Eric

  • Author_Institution
    Lab. DALI, Univ. de Perpignan Via Domitia, Perpignan, France
  • fYear
    2013
  • fDate
    15-18 July 2013
  • Firstpage
    263
  • Lastpage
    270
  • Abstract
    Due to many factors such as, high transistor density, high frequency, and low voltage, today´s processors are more than ever subject to hardware failures. These errors have various impacts depending on the location of the error and the type of processor. Because of the hierarchical structure of the compute units and work scheduling, the hardware failure on GPUs affect only part of the application. In this paper we present a new methodology to characterize the hardware failures of Nvidia GPUs based on a software micro-benchmarking platform implemented in OpenCL. We also present which hardware part of TESLA architecture is more sensitive to intermittent errors, which usually appears when the processor is aging. We obtained these results by accelerating the aging process by running the processors at high temperature. We show that on GPUs, intermittent errors impact is limited to a localized architecture tile. Finally, we propose a methodology to detect, record location of defective units in order to avoid them to ensure the program correctness on such architectures, improving the GPU fault-tolerance capability and lifespan.
  • Keywords
    fault tolerant computing; graphics processing units; program testing; GPU hardware failures; GPUburn; Nvidia GPU; OpenCL; TESLA architecture; defective units; hierarchical structure; program correctness; software micro-benchmarking platform; Aging; Computer architecture; Graphics processing units; Hardware; Instruction sets; Kernel;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIII), 2013 International Conference on
  • Conference_Location
    Agios Konstantinos
  • Type

    conf

  • DOI
    10.1109/SAMOS.2013.6621133
  • Filename
    6621133