DocumentCode :
3429799
Title :
Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool
Author :
Dong Li ; Vetter, Jeffrey S. ; Weikuan Yu
fYear :
2012
fDate :
10-16 Nov. 2012
Firstpage :
1
Lastpage :
11
Abstract :
Extreme-scale scientific applications are at a significant risk of being hit by soft errors on supercomputers as the scale of these systems and the component density continues to increase. In order to better understand the specific soft error vulnerabilities in scientific applications, we have built an empirical fault injection and consequence analysis tool - BIFIT -that allows us to evaluate how soft errors impact applications. In particular, BIFIT is designed with capability to inject faults at very specific targets: an arbitrarily-chosen execution point and any specific data structure. We apply BIFIT to three mission-critical scientific applications and investigate the applications vulnerability to soft errors by performing thousands of statistical tests. We, then, classify each applications individual data structures based on their sensitivity to these vulnerabilities, and generalize these classifications across applications. Subsequently, these classifications can be used to apply appropriate resiliency solutions to each data structure within an application. Our study reveals that these scientific applications have a wide range of sensitivities to both the time and the location of a soft error; yet, we are able to identify intrinsic relationships between application vulnerabilities and specific types of data objects. In this regard, BIFIT enables new opportunities for future resiliency research.
Keywords :
fault diagnosis; parallel machines; scientific information systems; statistical testing; BIFIT; binary instrumentation tool; component density; consequence analysis tool; data structure; empirical fault injection; extreme-scale scientific application; mission-critical scientific application; soft error vulnerabilities; statistical test; supercomputer; Algorithm design and analysis; Data structures; Hardware; Instruments; Libraries; Object recognition; Resilience;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for
Conference_Location :
Salt Lake City, UT
ISSN :
2167-4329
Print_ISBN :
978-1-4673-0805-2
Type :
conf
DOI :
10.1109/SC.2012.29
Filename :
6468536
Link To Document :
بازگشت