• DocumentCode
    244347
  • Title

    Warped-Shield: Tolerating Hard Faults in GPGPUs

  • Author

    Dweik, Waleed ; Abdel-Majeed, M. ; Annavaram, Murali

  • Author_Institution
    Ming Hsieh Dept. of Electr. Eng., Univ. of Southern California, Los Angeles, CA, USA
  • fYear
    2014
  • fDate
    23-26 June 2014
  • Firstpage
    431
  • Lastpage
    442
  • Abstract
    Graphics processing units (GPUs) are rapidly becoming the parallel accelerators of choice to run general purpose applications. GPUs that run general purpose applications are termed as GPGPUs. Many mission-critical and long-running scientific application are being ported to run on GPGPUs. These applications demand strong computational integrity. GPGPUs, like many other digital components, face imminent reliability threats due to technology scaling. Of particular concern is the infield hard faults that are persistent and irreversible. GPGPUs comprise of dozens of streaming processors where each streaming processor employs tens of execution units, organized as single instruction multiple thread (SIMT) lanes to deliver massive parallel computational power. In this paper we exploit the massive replication of SIMT lanes to tolerate infield hard faults. First, we introduce thread shuffling to reroute threads, originally mapped to faulty SIMT lanes, to idle healthy lanes. Thread shuffling is insufficient when the number of healthy SIMT lanes is fewer than the number of active threads. To broaden the reach of thread shuffling, we propose dynamic warp deformation to split the warp into multiple sub-warps, each sub-warp uses fewer SIMT lanes thereby providing more opportunities to avoid using a faulty SIMT lane. Finally, we propose warp shuffling which exploits non-uniform degradation of different streaming processors by scheduling a warp to a streaming processor that requires fewer warp splits. Hence, warp shuffling helps to reduce the performance overhead associated with dynamic warp deformation. By deploying the proposed techniques, we can tolerate the worst case scenario of having up to three hard faults per four SIMT lane cluster with at most 36%performance degradation.
  • Keywords
    fault tolerant computing; graphics processing units; multi-threading; parallel processing; scheduling; GPGPUs; SIMT lanes; computational integrity; dynamic warp deformation; general purpose applications; graphics processing units; infield hard fault tolerance; long-running scientific application; mission-critical scientific application; parallel accelerators; parallel computational power; performance overhead reduction; single instruction multiple thread lanes; streaming processors; thread rerouting; thread shuffling; warp scheduling; warp shuffling; warped-shield; Benchmark testing; Fault tolerance; Fault tolerant systems; Instruction sets; Optimized production technology; Registers; Single instruction multiple threads (SIMT); thread shuffling; warp deformation; warp shuffling;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Dependable Systems and Networks (DSN), 2014 44th Annual IEEE/IFIP International Conference on
  • Conference_Location
    Atlanta, GA
  • Type

    conf

  • DOI
    10.1109/DSN.2014.95
  • Filename
    6903600