DocumentCode :
2246552
Title :
Fault tolerant matrix operations using checksum and reverse computation
Author :
Kim, Youngbae ; Plank, James S. ; Dongarra, Jack J.
Author_Institution :
Dept. of Comput. Sci., Tennessee Univ., Knoxville, TN, USA
fYear :
1996
fDate :
27-31 Oct 1996
Firstpage :
70
Lastpage :
77
Abstract :
In this paper, we present a technique, based on checksum and reverse computation, that enables high-performance matrix operations to be fault-tolerant with low overhead. We have implemented this technique on five matrix operations: matrix multiplication, Cholesky factorization, LU factorization, QR factorization and Hessenberg reduction. The overhead of checkpointing and recovery is analyzed both theoretically and experimentally. These analyses confirm that our technique can provide fault tolerance for these high-performance matrix operations with low overhead
Keywords :
digital arithmetic; fault tolerant computing; matrix algebra; matrix multiplication; roundoff errors; Cholesky factorization; Hessenberg reduction; LU factorization; QR factorization; checkpointing; checksum; fault tolerant matrix operations; high-performance matrix operations; matrix multiplication; recovery; reverse computation; Availability; Checkpointing; Computer science; Fault tolerance; High performance computing; Lifting equipment; Linear programming; Performance analysis; Roundoff errors; Workstations;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Frontiers of Massively Parallel Computing, 1996. Proceedings Frontiers '96., Sixth Symposium on the
Conference_Location :
Annapolis, MD
ISSN :
1088-4955
Print_ISBN :
0-8186-7551-9
Type :
conf
DOI :
10.1109/FMPC.1996.558063
Filename :
558063
Link To Document :
بازگشت