• DocumentCode
    2310868
  • Title

    Adaptive Checkpoint Replication for Supporting the Fault Tolerance of Applications in the Grid

  • Author

    Luckow, André ; Schnor, Bettina

  • Author_Institution
    Inst. of Comput. Sci., Univ. of Potsdam, Potsdam
  • fYear
    2008
  • fDate
    10-12 July 2008
  • Firstpage
    299
  • Lastpage
    306
  • Abstract
    A major challenge in a dynamic Grid with thousands of machines connected to each other is fault tolerance. The more resources and components involved, themore complicated and error-prone becomes the system. Migol is an adaptive Grid middleware, which addresses the fault tolerance of Grid applications and services by providing the capability to recover applications from checkpoint files automatically. A critical aspect for an automatic recovery is the availability of checkpoint files: If a resource becomes unavailable, it is very likely that the associated storage is also unreachable, e. g. due to a network partition. A strategy to increase the availability of checkpoints isreplication.In this paper, we present the Checkpoint Replication Service. A key feature of this service is the ability to automatically replicate and monitor checkpoints in the Grid.
  • Keywords
    checkpointing; grid computing; middleware; software fault tolerance; adaptive Grid middleware; adaptive checkpoint replication; checkpoint replication service; fault tolerance; Application software; Availability; Checkpointing; Computer applications; Computer networks; Fault tolerance; Humans; Libraries; Middleware; Resonance light scattering; Checkpointing; Grid Computing; Replication;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Network Computing and Applications, 2008. NCA '08. Seventh IEEE International Symposium on
  • Conference_Location
    Cambridge, MA
  • Print_ISBN
    978-0-7695-3192-2
  • Electronic_ISBN
    978-0-7695-3192-2
  • Type

    conf

  • DOI
    10.1109/NCA.2008.38
  • Filename
    4579677