مرکز منطقه ای اطلاع رساني علوم و فناوري - Mitigation of fail-stop failures in integer matrix products via numerical packing

DocumentCode :

734988

Title :

Mitigation of fail-stop failures in integer matrix products via numerical packing

Author :

Anarado, Ijeoma ; Andreopoulos, Yiannis

Author_Institution :

Electron. & Electr. Eng. Dept., Univ. Coll. London, London, UK

fYear :

2015

fDate :

6-8 July 2015

Firstpage :

101

Lastpage :

107

Abstract :

The decreasing mean-time-to-failure estimates of distributed computing systems indicate that high-performance generic matrix multiply (GEMM) routines running on such environments may need to mitigate an increasing number of fail-stop failures. We propose a new roll-forward solution to this problem that is based on the production of redundant results within the numerical representation of the outputs via the use of numerical packing. This differs from all existing roll-forward solutions that require a separate set of checksum (or duplicate) results. In particular, unlike all existing approaches, the proposed approach does not require additional hardware resources for failure mitigation. Instead, in our proposal the required duplication is inserted in the input matrices themselves. The accommodation of the duplicated inputs imposes 30.6% or 37.5% reduction in the maximum output bitwidth supported in comparison to integer matrix products performed on 32-bit floating-point or integer representations, respectively. Nevertheless, this bitwidth reduction is comparable to the one imposed due to the checksum elements of traditional roll-forward methods, especially for cases where multiple core failures must be mitigated. Experiments performed on an Amazon EC2 instance with 6 Intel Haswell cores dedicated to GEMM computations show that, in comparison to the state-of-the-art failure-intolerant integer GEMM realization, the proposed approach incurs only 5-19.4% drop in the achievable peak performance. This overhead is significantly lower than the 33.3 - 37% overhead incurred by the equivalent checksum-based method.

Keywords :

mathematics computing; matrix algebra; parallel processing; Amazon EC2 instance; GEMM routine; Intel Haswell core; bitwidth reduction; distributed computing systems; fail-stop failure mitigation; failure-intolerant integer GEMM realization; floating-point representation; generic matrix multiply routine; integer matrix products; integer representation; mean-time-to-failure estimation; numerical packing; roll-forward solution; Approximation methods; Covariance matrices; Distributed computing; Dynamic range; Kernel; Proposals; Zirconium; distributed computing; fail-stop failures; integer matrix products; sum-of-products;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

On-Line Testing Symposium (IOLTS), 2015 IEEE 21st International

Conference_Location :

Halkidiki

Type :

conf

DOI :

10.1109/IOLTS.2015.7229840

Filename :

7229840

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=734988