• DocumentCode
    2933342
  • Title

    A programming model for resilience in extreme scale computing

  • Author

    Hukerikar, Saurabh ; Diniz, Pedro C. ; Lucas, Robert F.

  • Author_Institution
    Inf. Sci. Inst., Univ. of Southern California, Marina del Rey, CA, USA
  • fYear
    2012
  • fDate
    25-28 June 2012
  • Firstpage
    1
  • Lastpage
    6
  • Abstract
    System resilience is an important challenge that needs to be addressed in the era of extreme scale computing. Exascale supercomputers will be architected using millions of processor cores and memory modules. As process technology scales, the reliability of such systems will be challenged by the inherent unreliability of individual components due to extremely small transistor geometries, variability in silicon manufacturing processes, device aging, etc. Therefore, errors and failures in extreme scale systems will increasingly be the norm rather than the exception. Not all errors detected warrant catastrophic system failure, but there are presently no mechanisms for the programmer to communicate their knowledge of algorithmic fault tolerance to the system. We present a programming model approach for system resilience that allows programmers to explicitly express their fault tolerance knowledge. We propose novel resilience oriented programming model extensions and programming directives, and illustrate their effectiveness. An inference engine leverages this information and combines it with runtime gathered context to increase the dependability of HPC systems.
  • Keywords
    catastrophe theory; elemental semiconductors; fault tolerance; inference mechanisms; mainframes; manufacturing processes; memory architecture; parallel machines; transistors; HPC systems; algorithmic fault tolerance; catastrophic system failure; device aging; exascale supercomputers; extreme scale computing; fault tolerance knowledge; inference engine; memory modules; processor cores; programming directives; programming model approach; programming model extensions; resilience programming model; silicon manufacturing processes; small transistor geometries; Computational modeling; Context; Engines; Error correction codes; Programming; Resilience; Runtime; Exascale; Fault Tolerance; High-Performance Computing; Resilience;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Dependable Systems and Networks Workshops (DSN-W), 2012 IEEE/IFIP 42nd International Conference on
  • Conference_Location
    Boston, MA
  • Print_ISBN
    978-1-4673-2264-5
  • Electronic_ISBN
    978-1-4673-2265-2
  • Type

    conf

  • DOI
    10.1109/DSNW.2012.6264671
  • Filename
    6264671