DocumentCode
3588658
Title
Extending checksum-based ABFT to tolerate soft errors online in iterative methods
Author
Longxiang Chen ; Dingwen Tao ; Panruo Wu ; Zizhong Chen
Author_Institution
Univ. of California, Riverside, Riverside, CA, USA
fYear
2014
Firstpage
344
Lastpage
351
Abstract
As the size and complexity of high performance computers increase, more soft errors will be encountered during computations. Algorithm-Based Fault Tolerance (ABFT) has been proved to be a highly efficient technique to detect soft errors in dense linear algebra operations including matrix multiplication, Cholesky and LU factorization. While ABFT can also be applied to a iterative sparse linear algebra algorithm via applying it to every individual matrix-vector multiplication in the algorithm, it often introduces considerable overhead. In this paper, we propose novel extensions to ABFT to not only reduce the overhead but also protect computations that can not be protected by existing ABFT. Instead of maintaining checksums in every individual matrix-vector multiplication, we modified the algorithms so that checksums established at the beginning of the algorithms can be maintained at every iterations throughout the algorithms. Because soft errors in most iterative sparse linear algebra algorithms will propagate from one iteration to another, we do not have to verify the correctness of the checksums at each iteration to detect errors. By reducing the frequency of verification, the fault tolerance overhead can be greatly reduced. Experimental results demonstrate that, when used with local diskless checkpoints together, our approach introduces much less overhead than the existing ABFT techniques.
Keywords
checkpointing; iterative methods; matrix decomposition; matrix multiplication; parallel processing; software fault tolerance; sparse matrices; Cholesky factorization; LU factorization; algorithm-based fault tolerance; checksum-based ABFT extension; diskless checkpoints; fault tolerance overhead reduction; iterative sparse linear algebra algorithm; matrix-vector multiplication; online soft error tolerance; soft error detection; Checkpointing; Fault detection; Fault tolerance; Fault tolerant systems; Iterative methods; Sparse matrices; Symmetric matrices; ABFT; Diskless Checkpoint; Error Detection; Iterative Methods; Soft Errors;
fLanguage
English
Publisher
ieee
Conference_Titel
Parallel and Distributed Systems (ICPADS), 2014 20th IEEE International Conference on
Type
conf
DOI
10.1109/PADSW.2014.7097827
Filename
7097827
Link To Document