Title :
Soft error propagation in floating-point programs
Author :
Li, Sha ; Li, Xiaoming
Author_Institution :
Dept. of Electr. & Comput. Eng., Univ. of Delaware, Newark, DE, USA
Abstract :
As technology scales, VLSI performance has experienced an exponential growth. As feature sizes shrink, however, we will face new challenges such as soft errors (single-event upsets) to maintain the reliability of circuits. Recent studies have tried to address soft errors with error detection and correction techniques such as error correcting codes and redundant execution. However, these techniques come at a cost of additional storage or lower performance. In this paper, we present a different approach to address soft errors. We start from building a quantitative understanding of the error propagation in software and propose a systematic evaluation of the impact of bit flip caused by soft errors on floating-point operations. Furthermore, we introduce a novel model to deal with soft errors. More specifically, we assume soft errors have occurred in memory and try to know how the errors will manifest in the results of programs. Therefore, some soft errors can be tolerated if the error in results is smaller than the intrinsic inaccuracy of floating-point representations or within a predefined range. We focus on analyzing error propagation for floating-point arithmetic operations. Our approach is motivated by interval analysis. We model the rounding effect of floating-point numbers, which enable us to simulate and predict the error propagation for single floating-point arithmetic operations for specific soft errors. In other words, we model and simulate the relation between the bit flip rate, which is determined by soft errors in hardware, and the error of floating-point arithmetic operations. The simulation results enable us to tolerate certain types of soft errors without expensive error detection and correction processing.
Keywords :
error correction codes; floating point arithmetic; software engineering; VLSI performance; circuit reliability; error correcting codes; floating-point arithmetic operation; floating-point representations; redundant execution technique; soft error propagation; Analytical models; Computational modeling; Computers; Error correction codes; Error probability; Fuses; Predictive models;
Conference_Titel :
Performance Computing and Communications Conference (IPCCC), 2010 IEEE 29th International
Conference_Location :
Albuquerque, NM
Print_ISBN :
978-1-4244-9330-2
DOI :
10.1109/PCCC.2010.5682305