Title :
Increasing the cluster availability using RADIC
Author :
Duarte, Angelo ; Rexachs, Dolores ; Luque, Emilio
Author_Institution :
Dept. of Comput. Archit. & Oper. Syst., Univ. Autonoma de Barcelona
Abstract :
The redundant array of distributed independent checkpoints (RADIC) is a fault tolerant architecture based on a fully distributed array of dedicated process. These processes collaborate to create a fault tolerance controller which transparently manages all fault tolerance activities. The architecture is designed as a software layer between the application and the cluster structure and it was developed to attend to the requirements of scalability, user transparency and independency of dedicated/stable cluster resources. RADIC only requires the resources already available in the nodes used by the parallel application and it uses a pessimistic message-log rollback-recovery protocol in order to operate without any global synchronization. Such protocol, together with the independence of central elements, makes RADIC a scalable architecture that works transparently to the user. We tested the functionality and performance of the architecture in a real scenario using a prototype based on the MPI standard (RADICMPI)
Keywords :
checkpointing; fault tolerance; fault tolerant computing; message passing; MPI standard; RADIC; cluster availability; distributed array; distributed independent checkpoints; fault tolerance controller; fault tolerant architecture; message-log rollback-recovery protocol; parallel application; redundant array; scalable architecture; software layer; Application software; Collaboration; Computer architecture; Costs; Fault tolerance; Fault tolerant systems; Operating systems; Programming profession; Protocols; Scalability;
Conference_Titel :
Cluster Computing, 2006 IEEE International Conference on
Conference_Location :
Barcelona
Print_ISBN :
1-4244-0327-8
Electronic_ISBN :
1552-5244
DOI :
10.1109/CLUSTR.2006.311872