• DocumentCode
    3588658
  • Title

    Extending checksum-based ABFT to tolerate soft errors online in iterative methods

  • Author

    Longxiang Chen ; Dingwen Tao ; Panruo Wu ; Zizhong Chen

  • Author_Institution
    Univ. of California, Riverside, Riverside, CA, USA
  • fYear
    2014
  • Firstpage
    344
  • Lastpage
    351
  • Abstract
    As the size and complexity of high performance computers increase, more soft errors will be encountered during computations. Algorithm-Based Fault Tolerance (ABFT) has been proved to be a highly efficient technique to detect soft errors in dense linear algebra operations including matrix multiplication, Cholesky and LU factorization. While ABFT can also be applied to a iterative sparse linear algebra algorithm via applying it to every individual matrix-vector multiplication in the algorithm, it often introduces considerable overhead. In this paper, we propose novel extensions to ABFT to not only reduce the overhead but also protect computations that can not be protected by existing ABFT. Instead of maintaining checksums in every individual matrix-vector multiplication, we modified the algorithms so that checksums established at the beginning of the algorithms can be maintained at every iterations throughout the algorithms. Because soft errors in most iterative sparse linear algebra algorithms will propagate from one iteration to another, we do not have to verify the correctness of the checksums at each iteration to detect errors. By reducing the frequency of verification, the fault tolerance overhead can be greatly reduced. Experimental results demonstrate that, when used with local diskless checkpoints together, our approach introduces much less overhead than the existing ABFT techniques.
  • Keywords
    checkpointing; iterative methods; matrix decomposition; matrix multiplication; parallel processing; software fault tolerance; sparse matrices; Cholesky factorization; LU factorization; algorithm-based fault tolerance; checksum-based ABFT extension; diskless checkpoints; fault tolerance overhead reduction; iterative sparse linear algebra algorithm; matrix-vector multiplication; online soft error tolerance; soft error detection; Checkpointing; Fault detection; Fault tolerance; Fault tolerant systems; Iterative methods; Sparse matrices; Symmetric matrices; ABFT; Diskless Checkpoint; Error Detection; Iterative Methods; Soft Errors;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Systems (ICPADS), 2014 20th IEEE International Conference on
  • Type

    conf

  • DOI
    10.1109/PADSW.2014.7097827
  • Filename
    7097827