Title :
Device View Redundancy: An Adaptive Low-Overhead Fault Tolerance Mechanism for Many-Core System
Author :
Wentao Jia ; Chunyuan Zhang ; Jian Fu
Author_Institution :
Nat. Key Lab. of Parallel & Distrib. Process., Nat. Univ. of Defense Technol., Changsha, China
Abstract :
Continued increasing of fault rate in integrate circuit makes processors more susceptible to errors, especially many-core processor. Meanwhile, most systems or applications do not need full fault coverage, which has excessive overhead. So on-demand fault tolerance is desired for these applications. In this paper, we propose an adaptive low-overhead fault tolerance mechanism for many-core system, called Device View Redundancy(DVR). It treats fault tolerance as a device that can be configured and used by application when high reliability is needed. Nevertheless, DVR exploits the idle resources for low overhead fault tolerance, which is based on the observation that the utilization of many-core system is low due to lack of parallelism in application. Finally, the experiment shows that the performance overhead of DVR is reduced by 16% to 98% compared with full Dual Modular Redundancy(DMR).
Keywords :
fault tolerant computing; multiprocessing systems; DVR; adaptive low-overhead fault tolerance mechanism; device view redundancy; many-core processor; many-core system; on-demand fault tolerance; Fault tolerant systems; Hardware; Performance evaluation; Redundancy; Registers; dynamic coupling; idle resource exploitation; low-overhead; many core system; on-demand redundancy;
Conference_Titel :
High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (HPCC_EUC), 2013 IEEE 10th International Conference on
Conference_Location :
Zhangjiajie
DOI :
10.1109/HPCC.and.EUC.2013.299