DocumentCode
3501796
Title
Increasing the cluster availability using RADIC
Author
Duarte, Angelo ; Rexachs, Dolores ; Luque, Emilio
Author_Institution
Dept. of Comput. Archit. & Oper. Syst., Univ. Autonoma de Barcelona
fYear
2006
fDate
25-28 Sept. 2006
Firstpage
1
Lastpage
8
Abstract
The redundant array of distributed independent checkpoints (RADIC) is a fault tolerant architecture based on a fully distributed array of dedicated process. These processes collaborate to create a fault tolerance controller which transparently manages all fault tolerance activities. The architecture is designed as a software layer between the application and the cluster structure and it was developed to attend to the requirements of scalability, user transparency and independency of dedicated/stable cluster resources. RADIC only requires the resources already available in the nodes used by the parallel application and it uses a pessimistic message-log rollback-recovery protocol in order to operate without any global synchronization. Such protocol, together with the independence of central elements, makes RADIC a scalable architecture that works transparently to the user. We tested the functionality and performance of the architecture in a real scenario using a prototype based on the MPI standard (RADICMPI)
Keywords
checkpointing; fault tolerance; fault tolerant computing; message passing; MPI standard; RADIC; cluster availability; distributed array; distributed independent checkpoints; fault tolerance controller; fault tolerant architecture; message-log rollback-recovery protocol; parallel application; redundant array; scalable architecture; software layer; Application software; Collaboration; Computer architecture; Costs; Fault tolerance; Fault tolerant systems; Operating systems; Programming profession; Protocols; Scalability;
fLanguage
English
Publisher
ieee
Conference_Titel
Cluster Computing, 2006 IEEE International Conference on
Conference_Location
Barcelona
ISSN
1552-5244
Print_ISBN
1-4244-0327-8
Electronic_ISBN
1552-5244
Type
conf
DOI
10.1109/CLUSTR.2006.311872
Filename
4100378
Link To Document