Title : 
A hierarchical checkpointing protocol for parallel applications in cluster federations
         
        
            Author : 
Monnet, Sébastien ; Morin, Christine ; Badrinath, Ramamurthy
         
        
            Author_Institution : 
IRISA, Rennes, France
         
        
        
        
        
            Abstract : 
Summary form only given. Code coupling applications can be divided into communicating modules, that may be executed on different clusters in a cluster federation. As a cluster federation comprises of a large number of nodes, there is a high probability of a node failure. We propose a hierarchical checkpointing protocol that combines a synchronized checkpointing technique inside clusters and a communication-induced technique between clusters. This protocol fits to the characteristics of a cluster federation (large number of nodes, high latency and low bandwidth networking technologies between clusters). A preliminary performance evaluation performed using a discrete event simulator shows that the protocol is suitable for code coupling applications.
         
        
            Keywords : 
discrete event simulation; parallel processing; performance evaluation; protocols; system recovery; workstation clusters; cluster federations; code coupling applications; discrete event simulator; hierarchical checkpointing protocol; node failure; parallel applications; performance evaluation; Application software; Bandwidth; Checkpointing; Delay; Discrete event simulation; ISO standards; Local area networks; Performance evaluation; Protocols; Storage area networks;
         
        
        
        
            Conference_Titel : 
Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International
         
        
            Print_ISBN : 
0-7695-2132-0
         
        
        
            DOI : 
10.1109/IPDPS.2004.1303242