• DocumentCode
    3200318
  • Title

    Design for a Soft Error Resilient Dynamic Task-Based Runtime

  • Author

    Chongxiao Cao ; Herault, Thomas ; Bosilca, George ; Dongarra, Jack

  • Author_Institution
    Univ. of Tennessee, Knoxville, TN, USA
  • fYear
    2015
  • fDate
    25-29 May 2015
  • Firstpage
    765
  • Lastpage
    774
  • Abstract
    As the scale of modern computing systems grows, failures will happen more frequently. On the way to Exactable a generic, low-overhead, resilient extension becomes a desired aptitude of any programming paradigm. In this paper we explore three additions to a dynamic task-based runtime to build a generic framework providing soft error resilience to task-based programming paradigms. The first recovers the application by re-executing the minimum required sub-DAG, the second takes critical checkpoints of the data flowing between tasks to minimize the necessary re-execution, while the last one takes advantage of algorithmic properties to recover the data without re-execution. These mechanisms have been implemented in the PaRSEC task-based runtime framework. Experimental results validate our approach and quantify the overhead introduced by such mechanisms.
  • Keywords
    directed graphs; fault tolerant computing; parallel processing; PaRSEC task-based runtime framework; algorithmic property; generic framework; modern computing system; resilient extension; soft error resilience; soft error resilient dynamic task-based runtime; sub-DAG; task-based programming paradigm; Algorithm design and analysis; Dynamic scheduling; Fault tolerance; Fault tolerant systems; Heuristic algorithms; Runtime; Symmetric matrices; fault tolerance; runtime; soft error resilience;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International
  • Conference_Location
    Hyderabad
  • ISSN
    1530-2075
  • Type

    conf

  • DOI
    10.1109/IPDPS.2015.81
  • Filename
    7161563