DocumentCode
2013872
Title
Transient Fault Tolerance on Chip Multiprocessor Based on Dual and Triple Core Redundancy
Author
Gong, Rui ; Dai, Kui ; Wang, Zhiying
Author_Institution
Sch. of Comput., Nat. Univ. of Defense Technol., Changsha, China
fYear
2008
fDate
15-17 Dec. 2008
Firstpage
273
Lastpage
280
Abstract
To address the increasing susceptibility of microprocessors to transient faults, many techniques have been proposed to exploit the core redundancy of chip multiprocessors (CMPs). But the inter-core communications become critical in these core redundancy based techniques. To reduce the inter-core communication bandwidth demand, two new approaches, dual core redundancy (DCR) and triple core redundancy (TCR), are proposed for fault tolerance in this paper. In DCR, only store instructions are compared before commit, so that the bandwidth demand can be largely reduced. And the fault recovery is achieved by context saving and recovery. While TCR applies triple modular redundancy (TMR) in the core level to efficiently exploit the core resources of CMPs for transient fault masking. In TCR, only the results of store instructions are compared to detect transient fault and reduce the inter-core communication bandwidth demand. Once detecting a single event upset (SEU), TCR can be reconfigured to execute with the two uncorrupted cores for fault detection.The experimental results demonstrate that compared to traditional transient fault recovery scheme CRTR, both DCR and TCR efficiently reduce inter-core bandwidth demand. DCR achieves transient fault recovery with reasonable performance overhead caused by context saving. TCR occupies more core resources and has the lowest performance overhead during normal execution.
Keywords
fault tolerance; microprocessor chips; redundancy; Transient fault tolerance; chip multiprocessors; dual core redundancy; fault detection; intercore communications; triple core redundancy; triple modular redundancy; Bandwidth; Cathode ray tubes; Context; Event detection; Fault detection; Fault tolerance; Microprocessors; Redundancy; Single event upset; Yarn; Chip Multiprocessor; Dual Core Redundancy; Transient Fault Tolerance; Triple Core Redundancy;
fLanguage
English
Publisher
ieee
Conference_Titel
Dependable Computing, 2008. PRDC '08. 14th IEEE Pacific Rim International Symposium on
Conference_Location
Taipei
Print_ISBN
978-0-7695-3448-0
Electronic_ISBN
978-0-7695-3448-0
Type
conf
DOI
10.1109/PRDC.2008.40
Filename
4725306
Link To Document