DocumentCode
159484
Title
GPGPUs ECC efficiency and efficacy
Author
Oliveira, Daniel A. G. ; Rech, P. ; Pilla, Laercio L. ; Navaux, Philippe Olivier Alexandre ; Carro, Luigi
Author_Institution
Inst. of Inf., Fed. Univ. of Rio Grande do Sul, Porto Alegre, Brazil
fYear
2014
fDate
1-3 Oct. 2014
Firstpage
209
Lastpage
215
Abstract
In this paper we assess and discuss the efficiency and overhead of the Error-Correcting Code (ECC) mechanism available on modern GPGPUs, which are increasingly used for both High Performance Computing and safety-critical applications. Both the resilience to radiation-induced silent data corruption and functional interruption are experimentally and analytically addressed. The provided experimental analysis demonstrates that the ECC significantly reduces the occurrence of silent data corruption but may not be sufficient to guarantee high reliability. Moreover, the ECC increases the GPGPU functional interruption rate. Finally, the ECC performances and reliability are compared to Algorithm-Based Fault Tolerance and Duplication With Comparison strategies.
Keywords
electronic engineering computing; error correction codes; fault tolerant computing; ECC efficiency; GPGPU; algorithm-based fault tolerance; error-correcting code mechanism; functional interruption rate; high performance computing; radiation-induced silent data corruption; safety-critical application; Benchmark testing; Error correction codes; Graphics processing units; Instruction sets; Interrupters; Neutrons; Reliability; ABFT; ECC; GPGPU; duplication with comparison; functional interruption; silent data corruption;
fLanguage
English
Publisher
ieee
Conference_Titel
Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT), 2014 IEEE International Symposium on
Conference_Location
Amsterdam
Print_ISBN
978-1-4799-6154-2
Type
conf
DOI
10.1109/DFT.2014.6962085
Filename
6962085
Link To Document