DocumentCode
3696960
Title
Exploiting Spatial Smoothness in HPC Applications to Detect Silent Data Corruption
Author
Leonardo Bautista-Gomez;Franck Cappello
fYear
2015
Firstpage
128
Lastpage
133
Abstract
Next-generation supercomputers are expected to have more components and, at the same time, consume several times less energy per operation. This situation is pushing supercomputer constructors to the limits of miniaturization and energy-saving strategies. Consequently, the number of soft errors is expected to increase dramatically in the coming years. While mechanisms are in place to correct or at least detect soft errors, a percentage of those errors pass unnoticed by the system. Such silent errors are extremely damaging because they can make applications produce wrong results. In this paper we propose a technique that leverages certain properties of HPC applications in order to detect silent errors at the application level. Our technique detects corruption solely based on the data behavior and is algorithm-agnostic. We show that this strategy can detect up to 90% of injected errors in some regions while incurring less than 1% overhead.
Keywords
"Detectors","Supercomputers","Random access memory","Error correction codes","Entropy","Reliability","Registers"
Publisher
ieee
Conference_Titel
High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conferen on Embedded Software and Systems (ICESS), 2015 IEEE 17th International Conference on
Type
conf
DOI
10.1109/HPCC-CSS-ICESS.2015.9
Filename
7336154
Link To Document