DocumentCode :
1918587
Title :
Poster: Programming Model Extensions for Resilience in Extreme Scale Computing
Author :
Hukerikar, Saurabh ; Diniz, Pedro C. ; Lucas, Robert F.
fYear :
2012
fDate :
10-16 Nov. 2012
Firstpage :
1434
Lastpage :
1434
Abstract :
System resilience is a key challenge to building extreme scale systems. A large number of HPC applications are inherently resilient, but application programmers lack mechanisms to convey their fault tolerance knowledge to the system. We present a cross-layer approach to resilience in which we propose a set of programming model extensions and develop a runtime inference framework that can reason about the context and significance of faults, as they occur, to the application programmer´s fault tolerance expectations. We demonstrate using a set accelerated fault injection experiments the validity of our approach with a set of real scientific and engineering codes. Our experiments show that a cross-layer approach that explicitly engages the programmer in expressing fault tolerance knowledge which is then leveraged across the layers of system abstraction can significantly improve the dependability of long running HPC applications.
fLanguage :
English
Publisher :
ieee
Conference_Titel :
High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:
Conference_Location :
Salt Lake City, UT
Print_ISBN :
978-1-4673-6218-4
Type :
conf
DOI :
10.1109/SC.Companion.2012.239
Filename :
6496022
Link To Document :
بازگشت