Title :
Adaptive Checkpoint Replication for Supporting the Fault Tolerance of Applications in the Grid
Author :
Luckow, André ; Schnor, Bettina
Author_Institution :
Inst. of Comput. Sci., Univ. of Potsdam, Potsdam
Abstract :
A major challenge in a dynamic Grid with thousands of machines connected to each other is fault tolerance. The more resources and components involved, themore complicated and error-prone becomes the system. Migol is an adaptive Grid middleware, which addresses the fault tolerance of Grid applications and services by providing the capability to recover applications from checkpoint files automatically. A critical aspect for an automatic recovery is the availability of checkpoint files: If a resource becomes unavailable, it is very likely that the associated storage is also unreachable, e. g. due to a network partition. A strategy to increase the availability of checkpoints isreplication.In this paper, we present the Checkpoint Replication Service. A key feature of this service is the ability to automatically replicate and monitor checkpoints in the Grid.
Keywords :
checkpointing; grid computing; middleware; software fault tolerance; adaptive Grid middleware; adaptive checkpoint replication; checkpoint replication service; fault tolerance; Application software; Availability; Checkpointing; Computer applications; Computer networks; Fault tolerance; Humans; Libraries; Middleware; Resonance light scattering; Checkpointing; Grid Computing; Replication;
Conference_Titel :
Network Computing and Applications, 2008. NCA '08. Seventh IEEE International Symposium on
Conference_Location :
Cambridge, MA
Print_ISBN :
978-0-7695-3192-2
Electronic_ISBN :
978-0-7695-3192-2
DOI :
10.1109/NCA.2008.38