• DocumentCode
    1815133
  • Title

    Fault detection and tolerance mechanisms for future 1000 core systems

  • Author

    Fechner, B. ; Garbade, A. ; Weis, Sebastian ; Ungerer, Theo

  • Author_Institution
    Dept. of Comput. Sci., Univ. of Augsburg, Augsburg, Germany
  • fYear
    2013
  • fDate
    1-5 July 2013
  • Firstpage
    552
  • Lastpage
    554
  • Abstract
    The enormous growth in integration density enables to build processors with more and more cores on a single die, but also makes them orders of magnitude more vulnerable to faults due to voltage fluctuation, radiation, and process variations [4] etc. Since this trend will continue in the future, fault-tolerance mechanisms must be an essential part of such future systems if the computations are to be carried out on a reliable basis. Already, chip manufacturers have taken measures to handle faults in current multi-core processors such as error correcting codes for busses, caches etc. With a huge number of cores, common strategies like dual modular and triple modular redundant processing [5] along with massive parallel computing are possible. Threaded dataflow execution models are one way to exploit the parallelism of future 1000 core systems. Current GPU architectures reflect that [3]. The side-effect free execution of threads within the dataflow execution model can not only be used to provide massive parallel computational capacity, but also enables simple and efficient rollback mechanisms [16]. In this paper, we describe fault detection and tolerance mechanisms investigated within the TERAFLUX EC project [17], which offers a solution to exploit the massive parallelism offered by dataflow architectures at all abstraction levels.
  • Keywords
    data flow computing; fault tolerant computing; multi-threading; multiprocessing systems; parallel architectures; system recovery; 1000-core systems; GPU architectures; TERAFLUX EC project; chip manufacturers; dataflow architectures; dual modular redundant processing; fault detection; fault handling; fault tolerance mechanism; massive parallel computational capacity; multicore processors; rollback mechanisms; side-effect free thread execution; threaded dataflow execution models; triple modular redundant processing; Computer architecture; Fault detection; Frequency modulation; Instruction sets; Message systems; Monitoring; Reliability;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High Performance Computing and Simulation (HPCS), 2013 International Conference on
  • Conference_Location
    Helsinki
  • Print_ISBN
    978-1-4799-0836-3
  • Type

    conf

  • DOI
    10.1109/HPCSim.2013.6641467
  • Filename
    6641467