• DocumentCode
    3205272
  • Title

    Hauberk: Lightweight Silent Data Corruption Error Detector for GPGPU

  • Author

    Yim, Keun Soo ; Pham, Cuong ; Saleheen, Mushfiq ; Kalbarczyk, Zbigniew ; Iyer, Ravishankar

  • Author_Institution
    Center for Reliable & High Performance Comput., Univ. of Illinois at Urbana-Champaign, Urbana, IL, USA
  • fYear
    2011
  • fDate
    16-20 May 2011
  • Firstpage
    287
  • Lastpage
    300
  • Abstract
    High performance and relatively low cost of GPU-based platforms provide an attractive alternative for general purpose high performance computing (HPC). However, the emerging HPC applications have usually stricter output correctness requirements than typical GPU applications (i.e., 3D graphics). This paper first analyzes the error resiliency of GPGPU platforms using a fault injection tool we have developed for commodity GPU devices. On average, 16-33% of injected faults cause silent data corruption (SDC) errors in the HPC programs executing on GPU. This SDC ratio is significantly higher than that measured in CPU programs (<;2.3%). In order to tolerate SDC errors, customized error detectors are strategically placed in the source code of target GPU programs so as to minimize performance impact and error propagation and maximize recoverability. The presented Hauberk technique is deployed in seven HPC benchmark programs and evaluated using a fault injection. The results show a high average error detection coverage (~87%) with a small performance overhead (~15%).
  • Keywords
    computer graphic equipment; coprocessors; error detection; GPGPU platform; GPU application; GPU program; GPU-based platform; HPC benchmark programs; Hauberk technique; commodity GPU device; error detection coverage; error propagation; error resiliency; fault injection tool; high performance computing; recoverability; silent data corruption error detector; source code; Detectors; Fault tolerance; Fault tolerant systems; Graphics processing unit; Hardware; Kernel; Three dimensional displays;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International
  • Conference_Location
    Anchorage, AK
  • ISSN
    1530-2075
  • Print_ISBN
    978-1-61284-372-8
  • Electronic_ISBN
    1530-2075
  • Type

    conf

  • DOI
    10.1109/IPDPS.2011.36
  • Filename
    6012845