مرکز منطقه ای اطلاع رساني علوم و فناوري - An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance

DocumentCode :

628214

Title :

An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance

Author :

Sloan, Jeff ; Kumar, Ravindra ; Bronevetsky, Greg

Author_Institution :

Univ. of Illinois, Urbana-Champaign, Urbana, IL, USA

fYear :

2013

fDate :

24-27 June 2013

Firstpage :

Lastpage :

Abstract :

The increasing size and complexity of massively parallel systems (e.g. HPC systems) is making it increasingly likely that individual circuits will produce erroneous results. For this reason, novel fault tolerance approaches are increasingly needed. Prior fault tolerance approaches often rely on checkpoint-rollback based schemes. Unfortunately, such schemes are primarily limited to rare error event scenarios as the overheads of such schemes become prohibitive if faults are common. In this paper, we propose a novel approach for algorithmic correction of faulty application outputs. The key insight for this approach is that even under high error scenarios, even if the result of an algorithm is erroneous, most of it is correct. Instead of simply rolling back to the most recent checkpoint and repeating the entire segment of computation, our novel resilience approach uses algorithmic error localization and partial recomputation to efficiently correct the corrupted results. We evaluate our approach in the specific algorithmic scenario of linear algebra operations, focusing on matrix-vector multiplication (MVM) and iterative linear solvers. We develop a novel technique for localizing errors in MVM and show how to achieve partial recomputation within this algorithm, and demonstrate that this approach both improves the performance of the Conjugate Gradient solver in high error scenarios by 3x-4x and increases the probability that it completes successfully by up to 60% with parallel experiments up to 100 nodes.

Keywords :

checkpointing; computational complexity; conjugate gradient methods; mathematics computing; matrix multiplication; parallel processing; probability; software fault tolerance; vectors; MVM; algorithmic correction; checkpoint rollback-based scheme; conjugate gradient solver; error localization; faulty application outputs; iterative linear solvers; linear algebra operations; low-overhead fault tolerance; massively parallel system complexity; massively parallel system size; matrix-vector multiplication; partial recomputation; performance improvement; probability; Circuit faults; Context; Error analysis; Fault tolerance; Fault tolerant systems; Sparse matrices; Vectors; algorithmic error correction; error localization; numerical methods; partial recomputation; sparse linear algebra;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Dependable Systems and Networks (DSN), 2013 43rd Annual IEEE/IFIP International Conference on

Conference_Location :

Budapest

ISSN :

1530-0889

Print_ISBN :

978-1-4673-6471-3

Type :

conf

DOI :

10.1109/DSN.2013.6575309

Filename :

6575309

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=628214