• DocumentCode
    1236801
  • Title

    Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids

  • Author

    Chtepen, Maria ; Claeys, Filip H A ; Dhoedt, Bart ; De Turck, Filip ; Demeester, Piet ; Vanrolleghem, Peter A.

  • Author_Institution
    Dept. of Inf. Technol., Ghent Univ., Gent
  • Volume
    20
  • Issue
    2
  • fYear
    2009
  • Firstpage
    180
  • Lastpage
    190
  • Abstract
    A grid is a distributed computational and storage environment often composed of heterogeneous autonomously managed subsystems. As a result, varying resource availability becomes commonplace, often resulting in loss and delay of executing jobs. To ensure good grid performance, fault tolerance should be taken into account. Commonly utilized techniques for providing fault tolerance in distributed systems are periodic job checkpointing and replication. While very robust, both techniques can delay job execution if inappropriate checkpointing intervals and replica numbers are chosen. This paper introduces several heuristics that dynamically adapt the above mentioned parameters based on information on grid status to provide high job throughput in the presence of failure while reducing the system overhead. Furthermore, a novel fault-tolerant algorithm combining checkpointing and replication is presented. The proposed methods are evaluated in a newly developed grid simulation environment dynamic scheduling in distributed environments (DSiDE), which allows for easy modeling of dynamic system and job behavior. Simulations are run employing workload and system parameters derived from logs that were collected from several large-scale parallel production systems. Experiments have shown that adaptive approaches can considerably improve system performance, while the preference for one of the solutions depends on particular system characteristics, such as load, job submission patterns, and failure frequency.
  • Keywords
    checkpointing; fault tolerant computing; grid computing; open systems; performance evaluation; resource allocation; adaptive task checkpointing; adaptive task replication; delay job execution; distributed storage environment; fault tolerance; fault-tolerant grid; heterogeneous autonomous managed subsystem; resource availability; system performance; Distributed Systems; Distributed systems; Performance of Systems; availability.; fault tolerance; performance of systems;
  • fLanguage
    English
  • Journal_Title
    Parallel and Distributed Systems, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1045-9219
  • Type

    jour

  • DOI
    10.1109/TPDS.2008.93
  • Filename
    4531738