مرکز منطقه ای اطلاع رساني علوم و فناوري - Loop transformations for fault detection in regular loops on massively parallel systems

DocumentCode :

1281903

Title :

Loop transformations for fault detection in regular loops on massively parallel systems

Author :

Gong, Chun ; Melhem, Rami ; Gupta, Rajiv

Author_Institution :

Massachusetts Language Lab., Hewlett-Packard Co., Chelmsford, MA, USA

Volume :

Issue :

fYear :

1996

fDate :

12/1/1996 12:00:00 AM

Firstpage :

1238

Lastpage :

1249

Abstract :

Distributed-memory systems can incorporate thousands of processors at a reasonable cost. However, with an increasing number of processors in a system, fault detection and fault tolerance become critical issues. By replicating the computation on more than one processor and comparing the results produced by these processors, errors can be detected. During the execution of a program, due to data dependencies, typically not all of the processors in a multiprocessor system are busy at all times. Therefore processor schedules contain idle time slots and it is the goal of this work to exploit these idle time slots to schedule duplicated computation for the purpose of fault detection. We propose a compiler-assisted approach to fault detection in regular loops on distributed-memory systems. This approach achieves fault detection by duplicating the execution of statement instances. After carefully analyzing the data dependencies of a regular loop, selected instances of loop statements are duplicated in a way that ensures the desired fault coverage. We first present duplication strategies for fault detection and show that these strategies use idle processor times for executing replicated statements, whenever possible. Next, we present loop transformations to implement these fault-detection strategies. Also, a general framework for selecting appropriate loop transformations is developed. Experimental results performed on the CRAY-T3D show that the overhead of adding the fault detection capability is usually less than 25%, and is less than 10% when communication overhead is reduced by grouping messages

Keywords :

distributed memory systems; fault tolerant computing; CRAY-T3D; communication overhead; compiler-assisted approach; data dependencies; distributed memory systems; fault detection; fault tolerance; loop transformations; massively parallel systems; processor schedules; regular loops; statement instances; Costs; Fault detection; Fault tolerant systems; Hardware; Multiprocessing systems; Postal services; Program processors; Redundancy; Scalability; VLIW;

fLanguage :

English

Journal_Title :

Parallel and Distributed Systems, IEEE Transactions on

Publisher :

ieee

ISSN :

1045-9219

Type :

jour

DOI :

10.1109/71.553273

Filename :

553273

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1281903