Title : 
Fault Tolerance in Cluster Federations with O2P-CF
         
        
            Author : 
Ropars, Thomas ; Morin, Christine
         
        
            Author_Institution : 
IRISA/Paris Project-Team, Paris
         
        
        
        
        
            Abstract : 
Fault tolerance is one of the key issues for large scale applications executed on high performance computing systems. In a cluster federation, clusters are gathered to provide huge computing power. To work efficiently on such systems, networks characteristics have to be taken into account: the latency between two nodes of different clusters is much higher than the latency between two nodes of the same cluster. In this paper, we present O2P-CF a message logging protocol well-suited to provide fault tolerance for message passing applications executed on cluster federations. O2P-CF is based on the combination of O2P, an extremely optimistic message logging protocol, with a pessimistic message logging protocol.
         
        
            Keywords : 
fault tolerant computing; message passing; parallel processing; protocols; workstation clusters; O2P-CF protocol; cluster federations; fault tolerance; high performance computing systems; message passing applications; optimistic message logging protocol; pessimistic message logging protocol; Algorithm design and analysis; Delay; Fault tolerance; Fault tolerant systems; Grid computing; High performance computing; Large-scale systems; Libraries; Message passing; Protocols; Cluster federation; fault tolerance; message logging; message passing application;
         
        
        
        
            Conference_Titel : 
Cluster Computing and the Grid, 2008. CCGRID '08. 8th IEEE International Symposium on
         
        
            Conference_Location : 
Lyon
         
        
            Print_ISBN : 
978-0-7695-3156-4
         
        
            Electronic_ISBN : 
978-0-7695-3156-4
         
        
        
            DOI : 
10.1109/CCGRID.2008.76