Title :
Opportunistic application-level fault detection through adaptive redundant multithreading
Author :
Hukerikar, Saurabh ; Diniz, Pedro C. ; Lucas, Robert F. ; Teranishi, K.
Author_Institution :
Inf. Sci. Inst., Univ. of Southern California, Marina del Rey, CA, USA
Abstract :
As the scale and complexity of future High Performance Computing systems continues to grow, the rising frequency of faults and errors and their impact on HPC applications will make it increasingly difficult to accomplish useful computation. Traditional means of fault detection and correction are either hardware based or use software based redundancy. Redundancy based approaches usually entail complete replication of the program state or the computation and therefore incurs substantial overhead to application performance. Therefore, the wide-scale use of full redundancy in future exascale class systems is not a viable solution for error detection and correction. In this paper we present an application level fault detection approach that is based on adaptive redundant multithreading. Through a language level directive, the programmer can define structured code blocks. When these blocks are executed by multiple threads and their outputs compared, we can detect errors in specific parts of the program state that will ultimately determine the correctness of the application outcome. The compiler outlines such code blocks and a runtime system reasons whether their execution by redundant threads should enabled/disabled by continuously observing and learning about the fault tolerance state of the system. By providing flexible building blocks for application specific fault detection, our approach makes possible more reasonable performance overheads than full redundancy. Our results show that the overheads to application performance are in the range of 4% to 70% due to runtime system being continuously aware of the rate and source of system faults, rather than the usual overhead in the excess of 100% that is incurred by complete replication.
Keywords :
fault tolerant computing; multi-threading; parallel processing; HPC applications; adaptive redundant multithreading; error correction; error detection; fault correction; future exascale class systems; hardware based redundancy; high performance computing systems; language level directive; opportunistic application-level fault detection; redundancy based approach; software based redundancy; structured code blocks; Fault detection; Hardware; Instruction sets; Multithreading; Redundancy; Runtime;
Conference_Titel :
High Performance Computing & Simulation (HPCS), 2014 International Conference on
Conference_Location :
Bologna
Print_ISBN :
978-1-4799-5312-7
DOI :
10.1109/HPCSim.2014.6903692