Title :
Transparent fault tolerance middleware at user level
Author :
Castro, Marcela ; Rexachs, Dolores ; Luque, Emilio
Author_Institution :
Comput. Archit. & Oper. Syst. Dept., Univ. Autonoma de Barcelona, Barcelona, Spain
Abstract :
We present a design of a transparent fault tolerance middleware for message passing applications. The approach consists in transforming the interconnections used by the application in reliable ones and support log-based rollback recovery protocol. When one of the nodes of the cluster fails, the processes are recovered in a new one and the connections are reestablished. All this work is made automatically and in a transparent way for the application. This service can be optionally activated at runtime at user level. The models used for protection and recovering application and detection of failures are based on RADIC architecture. We have tested this middleware by executing a master-worker (M/W) and SPMD applications which follow different communication patterns.
Keywords :
message passing; middleware; software fault tolerance; system recovery; RADIC architecture; SPMD applications; communication patterns; failure detection; log-based rollback recovery protocol; master-worker applications; message passing applications; transparent fault tolerance middleware; user level; Fault tolerance; Fault tolerant systems; Libraries; Observers; Peer to peer computing; Protocols; Sockets; Fault-tolerance; High-Availability; RADIC; parallel computing;
Conference_Titel :
High Performance Computing and Simulation (HPCS), 2012 International Conference on
Conference_Location :
Madrid
Print_ISBN :
978-1-4673-2359-8
DOI :
10.1109/HPCSim.2012.6266974