Title :
Efficacy and efficiency of algorithm-based fault-tolerance on GPUs
Author :
Wunderlich, H.-J. ; Braun, Claus ; Halder, Sebastian
Author_Institution :
Inst. of Comput. Archit. & Comput. Eng., Univ. of Stuttgart, Stuttgart, Germany
Abstract :
Computer simulations drive innovations in science and industry, and they are gaining more and more importance. However, their high computational demand generates extraordinary challenges for computing systems. Typical high-performance computing systems, which provide sufficient performance and high reliability, are extremely expensive. Modern GPUs offer high performance at very low costs, and they enable simulation applications on the desktop. However, they are increasingly prone to transient effects and other reliability threats. To fulfill the strict reliability requirements in scientific computing and simulation technology, appropriate fault tolerance measures have to be integrated into simulation applications for GPUs. Algorithm-Based Fault Tolerance on GPUs has the potential to meet these requirements. In this work we investigate the efficiency and the efficacy of ABFT for matrix operations on GPUs. We compare ABFT against fault tolerance schemes that are based on redundant computations and we evaluate its error detection capabilities.
Keywords :
digital simulation; fault tolerant computing; graphics processing units; matrix algebra; parallel processing; reliability; ABFT; GPU; algorithm-based fault-tolerance; computational demand; computer simulations; desktop; error detection capabilities; high-performance computing systems; matrix operations; redundant computations; reliability requirements; reliability threats; scientific computing; Computational modeling; Fault tolerance; Fault tolerant systems; Graphics processing units; Instruction sets; Kernel; Tunneling magnetoresistance; Algorithm-based Fault Tolerance; Fault Simulation; GPGPU; Scientific Computing; Soft Errors;
Conference_Titel :
On-Line Testing Symposium (IOLTS), 2013 IEEE 19th International
Conference_Location :
Chania
DOI :
10.1109/IOLTS.2013.6604090