Fault detection and tolerance mechanisms for future 1000 core systems

Author

Fechner, B. ; Garbade, A. ; Weis, Sebastian ; Ungerer, Theo

Author_Institution

Dept. of Comput. Sci., Univ. of Augsburg, Augsburg, Germany

fYear

2013

fDate

1-5 July 2013

Firstpage

552

Lastpage

554

Abstract

The enormous growth in integration density enables to build processors with more and more cores on a single die, but also makes them orders of magnitude more vulnerable to faults due to voltage fluctuation, radiation, and process variations [4] etc. Since this trend will continue in the future, fault-tolerance mechanisms must be an essential part of such future systems if the computations are to be carried out on a reliable basis. Already, chip manufacturers have taken measures to handle faults in current multi-core processors such as error correcting codes for busses, caches etc. With a huge number of cores, common strategies like dual modular and triple modular redundant processing [5] along with massive parallel computing are possible. Threaded dataflow execution models are one way to exploit the parallelism of future 1000 core systems. Current GPU architectures reflect that [3]. The side-effect free execution of threads within the dataflow execution model can not only be used to provide massive parallel computational capacity, but also enables simple and efficient rollback mechanisms [16]. In this paper, we describe fault detection and tolerance mechanisms investigated within the TERAFLUX EC project [17], which offers a solution to exploit the massive parallelism offered by dataflow architectures at all abstraction levels.

Keywords

data flow computing; fault tolerant computing; multi-threading; multiprocessing systems; parallel architectures; system recovery; 1000-core systems; GPU architectures; TERAFLUX EC project; chip manufacturers; dataflow architectures; dual modular redundant processing; fault detection; fault handling; fault tolerance mechanism; massive parallel computational capacity; multicore processors; rollback mechanisms; side-effect free thread execution; threaded dataflow execution models; triple modular redundant processing; Computer architecture; Fault detection; Frequency modulation; Instruction sets; Message systems; Monitoring; Reliability;

fLanguage

English

Publisher

ieee

Conference_Titel

High Performance Computing and Simulation (HPCS), 2013 International Conference on

Conference_Location

Helsinki

Print_ISBN

978-1-4799-0836-3

Type

conf

DOI

10.1109/HPCSim.2013.6641467

Filename

6641467