• DocumentCode
    1882004
  • Title

    Warped-DMR: Light-weight Error Detection for GPGPU

  • Author

    Hyeran Jeon ; Annavaram, Murali

  • Author_Institution
    Univ. of Southern California, Los Angeles, CA, USA
  • fYear
    2012
  • fDate
    1-5 Dec. 2012
  • Firstpage
    37
  • Lastpage
    47
  • Abstract
    General purpose graphics processing units (GPGPUs) are feature rich GPUs that provide general purpose computing ability with massive number of parallel threads. The massive parallelism combined with programmability made GPGPUs the most attractive choice in supercomputing centers. Unsurprisingly, most of the GPGPU-based studies have been focusing on performance improvement leveraging GPGPU´s high degree of parallelism. However, for many scientific applications that commonly run on supercomputers, program correctness is as important as performance. Few soft or hard errors could lead to corrupt results and can potentially waste days or even months of computing effort. In this research we exploit unique architectural characteristics of GPGPUs to propose a light weight error detection method, called Warped Dual Modular Redundancy (Warped-DMR). Warped-DMR detects errors in computation by relying on opportunistic spatial and temporal dual-modular execution of code. Warped-DMR is light weight because it exploits the underutilized parallelism in GPGPU computing for error detection. Error detection spans both within a warp as well as between warps, called intra-warp and inter-warp DMR, respectively. Warped-DMR achieves 96% error coverage while incurring a worst-case 16% performance overhead without extra execution units or programmer´s effort.
  • Keywords
    error detection; graphics processing units; parallel machines; GPGPU; general purpose graphics processing units; lightweight error detection; parallel threads; program correctness; supercomputing centers; warped dual modular redundancy; warped-DMR; DMR; GPGPU; Reliable computing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Microarchitecture (MICRO), 2012 45th Annual IEEE/ACM International Symposium on
  • Conference_Location
    Vancouver, BC
  • ISSN
    1072-4451
  • Print_ISBN
    978-1-4673-4819-5
  • Type

    conf

  • DOI
    10.1109/MICRO.2012.13
  • Filename
    6493606