Title :
Fault tolerance via N-modular software redundancy
Author_Institution :
AT&T Bell Labs., Murray Hill, NJ, USA
Abstract :
Presents a novel method of "indirect" software instrumentation to achieve fault tolerance at the application level. Error detection and recovery are based on the well-known approach of replicating application processes on multiple computers in a network. The advantages of this fault tolerance scheme based on indirect instrumentation include: (1) a general error detection method that ensures data integrity for critical data without the need for any modification of the code, (2) a high degree of automation and transparency for fault-tolerant configuration and operation (i.e. the set-up time for a new application is on the order of a few minutes), and (3) the ability to perform error detection for applications for which no source code or only minimal knowledge of the code is available, including legacy applications. The types of faults that are tolerated include transient and permanent hardware faults on a single machine and certain types of application and operating system software faults.
Keywords :
data integrity; error detection; operating systems (computers); redundancy; software fault tolerance; system recovery; N-modular software redundancy; application process replication; application software faults; application-level fault tolerance; automation; critical data; data integrity; error detection; error recovery; indirect software instrumentation; legacy applications; multiprocessor network; operating system software faults; permanent hardware faults; setup time; source code; transient hardware faults; transparency; Application software; Automation; Computer errors; Computer networks; Fault detection; Fault tolerance; Hardware; Instruments; Operating systems; Redundancy;
Conference_Titel :
Fault-Tolerant Computing, 1998. Digest of Papers. Twenty-Eighth Annual International Symposium on
Conference_Location :
Munich, Germany
Print_ISBN :
0-8186-8470-4
DOI :
10.1109/FTCS.1998.689471