Title : 
Application MTTFE vs. Platform MTBF: A Fresh Perspective on System Reliability and Application Throughput for Computations at Scale
         
        
            Author : 
Daly, J.T. ; Pritchett-Sheats, L.A. ; Michala, S.E.
         
        
            Author_Institution : 
Los Alamos Nat. Lab., Los Alamos, NM
         
        
        
        
        
            Abstract : 
When running on HPC systems characterized by component failure rates high enough to impact productivity, it becomes important to consider the impact of those failures on individual applications. Typically, this is done by assuming that the mean time between failures (MTBF) for hardware and software components on the system is equivalent to the mean time to fatal error (MTTFE) for an application running on that system. In addition, one commonly applies the rule of thumb estimate that application MTTFE scales as the inverse of the number of nodes used to run the application, so that running on half as many nodes increases MTTFE by a factor of two. However, this estimate does not take into account the fact that a non-trivial fraction of failures affect multiple compute nodes, so a single component failure has the potential to cause multiple application fatal errors. In the work that follows, a new model for application MTTFE is derived based on the impact of multi-component failures and their potential to terminate multiple applications.
         
        
            Keywords : 
system recovery; hardware component failure; high performance computing; mean time between failure; mean time to fatal error; software component failure; system reliability; thumb rule estimation; Application software; Computer applications; Costs; Error correction; File systems; Grid computing; Reliability; Runtime; Throughput; Time measurement; MTBF; failure rate; failures; reliability; resilience;
         
        
        
        
            Conference_Titel : 
Cluster Computing and the Grid, 2008. CCGRID '08. 8th IEEE International Symposium on
         
        
            Conference_Location : 
Lyon
         
        
            Print_ISBN : 
978-0-7695-3156-4
         
        
            Electronic_ISBN : 
978-0-7695-3156-4
         
        
        
            DOI : 
10.1109/CCGRID.2008.103