DocumentCode :
3687113
Title :
Enabling application resilience through programming model based fault amelioration
Author :
Saurabh Hukerikar;Pedro C. Diniz;Robert F. Lucas
Author_Institution :
Information Sciences Institute, University of Southern California, Marina del Rey, USA
fYear :
2015
Firstpage :
1
Lastpage :
6
Abstract :
High-performance computing applications that will run on future exascale-class supercomputing systems are projected to encounter accelerated rates of faults and errors. For these large-scale systems, maintaining fault resilient operation is a key challenge. The most widely used resiliency approach today, which is based on checkpoint and rollback (C/R) recovery, is not expected to remain viable in the presence of frequent errors and failures. In this paper, we present a framework for enabling application-level recovery from error states through fault amelioration. Our approach is based on programming model extensions that enable algorithm-based fault amelioration knowledge to be expressed as an intrinsic feature of the programming environment. This is accomplished through a set of language extensions that are supported by a compiler infrastructure and a runtime system. We experimentally demonstrate that the framework enables recovery from errors in the program state with low overhead to the application performance.
Keywords :
"Runtime","Resilience","Programming","Semantics","Data structures","Program processors","Syntactics"
Publisher :
ieee
Conference_Titel :
High Performance Extreme Computing Conference (HPEC), 2015 IEEE
Type :
conf
DOI :
10.1109/HPEC.2015.7322460
Filename :
7322460
Link To Document :
بازگشت