DocumentCode :
258569
Title :
On providing scalable self-healing adaptive fault-tolerance to RTR SoCs
Author :
Navas, Byron ; Oberg, Johnny ; Sander, Ingo
Author_Institution :
Dept. of Electron. Syst., KTH R. Inst. of Technol., Stockholm, Sweden
fYear :
2014
fDate :
8-10 Dec. 2014
Firstpage :
1
Lastpage :
6
Abstract :
The dependability of heterogeneous many-core FPGA based systems are threatened by higher failure rates caused by disruptive scales of integration, increased design complexity, and radiation sensitivity. Triple-modular redundancy (TMR) and run-time reconfiguration (RTR) are traditional fault-tolerant (FT) techniques used to increase dependability. However, hardware redundancy is expensive and most approaches have poor scalability, flexibility, and programmability. Therefore, innovative solutions are needed to reduce the redundancy cost but still preserve acceptable levels of dependability. In this context, this paper presents the implementation of a self-healing adaptive fault-tolerant SoC that reuses RTR IP-cores in order to self-assemble different TMR schemes during run-time. The presented system demonstrates the feasibility of the Upset-Fault-Observer concept, which provides a run-time self-test and recovery strategy that delivers fault-tolerance over functions accelerated in RTR cores, at the same time reducing the redundancy scalability cost by running periodic reconfigurable TMR scan-cycles. In addition, this paper experimentally evaluates the trade-off of the implemented reconfigurable TMR schemes by characterizing important fault tolerant metrics i.e., recovery time (self-repair and self-replicate), detection latency, self-assembly latency, throughput reduction, and increase of physical resources.
Keywords :
fault tolerant computing; field programmable gate arrays; system-on-chip; RTR IP-cores; RTR SoCs; design complexity; failure rates; fault tolerant metrics; hardware redundancy; heterogeneous many-core FPGA based system dependability; periodic reconfigurable TMR scan-cycles; radiation sensitivity; recovery strategy; redundancy cost reduction; run-time reconfiguration; run-time self-test strategy; scalable self-healing adaptive fault-tolerance; self-healing adaptive fault-tolerant SoC; triple-modular redundancy; upset-fault-observer concept; Fault tolerant systems; Hardware; Redundancy; Software; System-on-chip; Tunneling magnetoresistance;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
ReConFigurable Computing and FPGAs (ReConFig), 2014 International Conference on
Conference_Location :
Cancun
Print_ISBN :
978-1-4799-5943-3
Type :
conf
DOI :
10.1109/ReConFig.2014.7032541
Filename :
7032541
Link To Document :
بازگشت