Title :
Replicating statement execution for fault detection on distributed memory multiprocessors
Author :
Gong, Chun ; Melhem, Rami ; Gupta, Rajiv
Author_Institution :
Dept. of Comput. Sci., Pittsburgh Univ., PA, USA
Abstract :
A compiler-assisted methodology is proposed for fault detection on distributed-memory systems. Selected instances of program statements are replicated in a way that ensures appropriate coverage. Replication strategies for the detection of permanent and transient faults are presented. These strategies use idle processor times for replicating statement execution whenever possible. Two approaches are also discussed for implementing the proposed strategies on single-program multiple-data parallel execution platforms. The first approach replicates program statements through source-to-source program transformations while the second approach achieves the replication of program statements indirectly by replicating data on multiple processors
Keywords :
distributed memory systems; fault tolerant computing; compiler-assisted methodology; distributed memory multiprocessors; distributed-memory systems; fault detection; idle processor times; permanent faults; program statements; source-to-source program transformations; statement execution replication; transient faults; Computer science; Costs; Fault detection; Hardware; Multiprocessing systems; Processor scheduling; Program processors; Random access memory; Redundancy; VLIW;
Conference_Titel :
Fault-Tolerant Parallel and Distributed Systems, 1994., Proceedings of IEEE Workshop on
Conference_Location :
College Station, TX
Print_ISBN :
0-8186-6807-5
DOI :
10.1109/FTPDS.1994.494484