مرکز منطقه ای اطلاع رساني علوم و فناوري - Mantissa-preserving operations and robust algorithm based fault tolerance for matrix computations

DocumentCode :

898830

Title :

Mantissa-preserving operations and robust algorithm based fault tolerance for matrix computations

Author :

Dutt, Shantanu ; Assaad, Fikri T.

Author_Institution :

Dept. of Electr. Eng., Minnesota Univ., Minneapolis, MN, USA

Volume :

Issue :

fYear :

1996

fDate :

4/1/1996 12:00:00 AM

Firstpage :

408

Lastpage :

424

Abstract :

A system-level method for achieving fault tolerance called algorithm-based fault tolerance (ABFT) has been proposed by a number of researchers. Many ABFT schemes use a floating-point checksum test to detect computation errors resulting from hardware faults. This makes the tests susceptible to roundoff inaccuracies in floating-point operations, which either cause false alarms or lead to undetected errors. Thresholding of the equality test has been commonly used to avoid false alarms; however, a good threshold that minimizes false alarms without reducing the error coverage significantly is difficult to find, especially when not much is known about the input data. Furthermore, thresholded checksums will inevitably miss lower-bit errors, which can get magnified as a computation such as LU decomposition progresses. We develop a theory for applying integer mantissa checksum tests to “mantissa-preserving” floating-point computations. This test is not susceptible to roundoff problems and yields 100% error coverage without false alarms. For computations that are not fully mantissa-preserving, we show how to apply the mantissa checksum test to the mantissa-preserving components of the computation and the floating-point test to the rest of the computation. We apply this general methodology to matrix-matrix multiplication and LU decomposition (using the Gaussian elimination (GE) algorithm), and find that the accuracy of this new “hybrid” testing scheme is substantially higher than the floating-point test with thresholding

Keywords :

error analysis; error detection; fault tolerant computing; floating point arithmetic; mathematics computing; matrix decomposition; matrix multiplication; roundoff errors; Gaussian elimination; LU decomposition; algorithm-based fault tolerance; computation error detection; error coverage; fault tolerance; floating-point checksum test; floating-point computation; floating-point operations; floating-point test; hardware faults; integer mantissa checksum tests; lower-bit errors; mantissa-preserving operations; matrix computations; matrix multiplication; robust algorithm; roundoff errors; thresholded checksums; thresholding; undetected errors; Computer Society; Computer errors; Error correction; Fault detection; Fault tolerance; Fault tolerant systems; Matrix decomposition; Robustness; Roundoff errors; System testing;

fLanguage :

English

Journal_Title :

Computers, IEEE Transactions on

Publisher :

ieee

ISSN :

0018-9340

Type :

jour

DOI :

10.1109/12.494099

Filename :

494099

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=898830