Title : 
Optimal cost-effective design of parallel systems subject to imperfect fault-coverage
         
        
            Author : 
Amari, Suprasad ; McLaughlin, Leland ; Yadlapati, Bhanu
         
        
            Author_Institution : 
Relex Software Corp., Greensburg, USA
         
        
        
        
        
        
            Abstract : 
Computer-based systems intended for critical applications are usually designed with sufficient redundancy to be tolerant of errors that may occur. However, under imperfect fault-coverage conditions (such as the system cannot adequately detect, locate, and recover from faults and errors in the system), system failures can result even when adequate redundancy is in place. Because parallel architecture is a well-known and powerful architecture for improving the reliability of fault-tolerant systems, this paper presents the cost-effective design policies of parallel systems subject to imperfect fault-coverage. The policies are designed by considering (1) cost of components, (2) failure cost of the system, (3) common-cause failures, and (4) performance levels of the system. Three kinds of cost functions are formulated considering that the total average cost of the system is based on: (1) system unreliability, (2) failure-time of the system, and (3) total processor-hours. It is shown that the MTTF (mean time to failure) of the system decreases by increasing the spares beyond a certain limit. Therefore, this paper also presents optimal design policies to maximize the MTTF of these systems. The results of this paper can also be applied to gracefully degradable systems.
         
        
            Keywords : 
failure analysis; fault tolerant computing; parallel architectures; MTTF; common-cause failures; components cost; critical application computer-based systems; error tolerance; failure cost; fault-tolerant systems reliability improvement; gracefully degradable systems; imperfect fault-coverage; mean time to failure; optimal cost-effective design; parallel architecture; parallel systems; performance levels; redundancy; system failure-time; system unreliability; total processor-hours; Application software; Computer errors; Cost function; Degradation; Fault detection; Fault tolerance; Fault tolerant systems; Parallel architectures; Power system reliability; Redundancy;
         
        
        
        
            Conference_Titel : 
Reliability and Maintainability Symposium, 2003. Annual
         
        
        
            Print_ISBN : 
0-7803-7717-6
         
        
        
            DOI : 
10.1109/RAMS.2003.1181898