Title :
A Fault Detection and Recovery Architecture for a Teradevice Dataflow System
Author :
Weis, Sebastian ; Garbade, Arne ; Wolf, Julian ; Fechner, Bernhard ; Mendelson, Avi ; Giorgi, Roberto ; Ungerer, Theo
Author_Institution :
Univ. of Augsburg, Augsburg, Germany
Abstract :
Future computing systems (Teradevices) will probably contain more than 1000 cores on a single die. To exploit this parallelism, threaded dataflow execution models are promising, since they provide side-effect free execution and reduced synchronization overhead. But the terascale transistor integration of such chips make them orders of magnitude more vulnerable to voltage fluctuation, radiation, and process variations. This means reliability techniques have to be an essential part of such future systems, too.In this paper, we conceptualize a fault tolerant architecture for a scalable threadeddataflow system. We provide methods to detect permanent, intermittent, and transientfaults during the execution. Furthermore, we propose a recovery technique for dataflow threads.
Keywords :
data flow computing; fault tolerant computing; multi-threading; parallel architectures; Teradevice dataflow system; fault recovery architecture; fault tolerant architecture; intermittent fault detection; permanent fault detection; scalable threaded dataflow system; side-effect free execution; terascale transistor integration; threaded dataflow execution model; transient fault detection; Computer architecture; Fault detection; Fault tolerance; Fault tolerant systems; Hardware; Instruction sets; Message systems; dataflow; fault tolerance; many-core; reliability; teradevice;
Conference_Titel :
Data-Flow Execution Models for Extreme Scale Computing (DFM), 2011 First Workshop on
Conference_Location :
Galveston Island, TX
Print_ISBN :
978-1-4673-0709-3