• DocumentCode
    159484
  • Title

    GPGPUs ECC efficiency and efficacy

  • Author

    Oliveira, Daniel A. G. ; Rech, P. ; Pilla, Laercio L. ; Navaux, Philippe Olivier Alexandre ; Carro, Luigi

  • Author_Institution
    Inst. of Inf., Fed. Univ. of Rio Grande do Sul, Porto Alegre, Brazil
  • fYear
    2014
  • fDate
    1-3 Oct. 2014
  • Firstpage
    209
  • Lastpage
    215
  • Abstract
    In this paper we assess and discuss the efficiency and overhead of the Error-Correcting Code (ECC) mechanism available on modern GPGPUs, which are increasingly used for both High Performance Computing and safety-critical applications. Both the resilience to radiation-induced silent data corruption and functional interruption are experimentally and analytically addressed. The provided experimental analysis demonstrates that the ECC significantly reduces the occurrence of silent data corruption but may not be sufficient to guarantee high reliability. Moreover, the ECC increases the GPGPU functional interruption rate. Finally, the ECC performances and reliability are compared to Algorithm-Based Fault Tolerance and Duplication With Comparison strategies.
  • Keywords
    electronic engineering computing; error correction codes; fault tolerant computing; ECC efficiency; GPGPU; algorithm-based fault tolerance; error-correcting code mechanism; functional interruption rate; high performance computing; radiation-induced silent data corruption; safety-critical application; Benchmark testing; Error correction codes; Graphics processing units; Instruction sets; Interrupters; Neutrons; Reliability; ABFT; ECC; GPGPU; duplication with comparison; functional interruption; silent data corruption;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT), 2014 IEEE International Symposium on
  • Conference_Location
    Amsterdam
  • Print_ISBN
    978-1-4799-6154-2
  • Type

    conf

  • DOI
    10.1109/DFT.2014.6962085
  • Filename
    6962085