Title :
Compiler assisted fault detection for distributed-memory systems
Author :
Gong, Chun ; Melhem, Rani ; Gupta, Rajiv
Author_Institution :
Dept. of Comput. Sci., Pittsburgh Univ., PA, USA
Abstract :
Distributed-memory systems provide the most promising performance to cost ratio for multiprocessor computers due to their scalability. However the issues of fault detection and fault tolerance are critical in such systems since the probability of having faulty components increases with the number of processors. We propose a methodology for fault detection through compiler support. More specifically, we augment the single-program multiple-data (SPMD) execution model to duplicate selected data items in such a way that during execution, whenever a value of a duplicated data is computed, the owners of the data are tested. The proposed compiler assisted fault detection technique does not require any specialized hardware and allows for a selective choice of redundancy at compile time
Keywords :
computer debugging; distributed memory systems; fault tolerant computing; program compilers; reliability; software reliability; compile time; compiler assisted fault detection; data item duplication; distributed-memory systems; fault tolerance; multiprocessor computers; performance to cost ratio; probability; redundancy; scalability; single-program multiple-data execution model; specialized hardware; Computer science; Costs; Distributed computing; Fault detection; Fault tolerance; Fault tolerant systems; Hardware; Multiprocessing systems; Redundancy; Testing;
Conference_Titel :
Scalable High-Performance Computing Conference, 1994., Proceedings of the
Conference_Location :
Knoxville, TN
Print_ISBN :
0-8186-5680-8
DOI :
10.1109/SHPCC.1994.296667