Title :
Reliability Enhancement of Fault-prone Many-core Systems Combining Spatial and Temporal Redundancy
Author_Institution :
Univ. of Wuerzburg, Wurzburg, Germany
Abstract :
The increasing transistor integration capacity will entail hundreds of processors on a single chip. Further, this will lead to an inherent susceptibility to errors of these systems. To obtain reliable systems again, various redundancy techniques can be applied. Of course, the usage of those techniques involves a significant overhead. Therefore, the identification of the optimal degree of redundancy is an important objective. In this paper we focus on core-level redundancy and checkpointing rollback-recovery. A model to determine the optimal degree of spatial and temporal redundancy regarding the minimal expected execution time will be introduced. Further, we will show that in several cases, the minimal expected execution time is achieved just by a simultaneous combination of both techniques, spatial redundancy and temporal redundancy.
Keywords :
fault tolerant computing; multiprocessing systems; redundancy; reliability; checkpointing rollback-recovery; core-level redundancy; fault-prone many-core systems; minimal expected execution time; reliability enhancement; spatial redundancy; temporal redundancy; transistor integration capacity; Checkpointing; Context; Integrated circuit reliability; Redundancy; Runtime; Transistors; combined redundancy; dependability; fault-prone; many-core; reliability; spatial redundancy; temporal redundancy;
Conference_Titel :
High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems (HPCC-ICESS), 2012 IEEE 14th International Conference on
Conference_Location :
Liverpool
Print_ISBN :
978-1-4673-2164-8
DOI :
10.1109/HPCC.2012.233