• DocumentCode
    866683
  • Title

    On the Design of Fault-Tolerant Scheduling Strategies Using Primary-Backup Approach for Computational Grids with Low Replication Costs

  • Author

    Zheng, Qin ; Veeravalli, Bharadwaj ; Tham, Chen-Khong

  • Author_Institution
    Inst. of High Performance Comput., Agency for Sci., Singapore
  • Volume
    58
  • Issue
    3
  • fYear
    2009
  • fDate
    3/1/2009 12:00:00 AM
  • Firstpage
    380
  • Lastpage
    393
  • Abstract
    Fault-tolerant scheduling is an imperative step for large-scale computational grid systems, as often geographically distributed nodes co-operate to execute a task. By and large, primary-backup approach is a common methodology used for fault tolerance wherein each task has a primary copy and a backup copy on two different processors. In this paper, we identify two cases that may happen when scheduling dependent tasks with primary-backup approach. We derive two important constraints that must be satisfied. Further, we show that these two constraints play a crucial role in limiting the schedulability and overloading efficiency of backups of dependent tasks. We then propose two strategies to improve schedulability and overloading efficiency, respectively. We propose two algorithms (MRC-ECT and MCT-LRC), to schedule backups of independent jobs and dependent jobs, respectively. MRC-ECT is shown to guarantee an optimal backup schedule in terms of replication cost for an independent task, while MCT-LRC can schedule a backup of a dependent task with minimum completion time and less replication cost. We conduct extensive simulation experiments to quantify the performance of the proposed algorithms.
  • Keywords
    computational complexity; directed graphs; fault tolerant computing; grid computing; processor scheduling; computational complexity; computational grid; directed acyclic graph; fault-tolerant scheduling; primary-backup approach; replication cost; Computer errors; Cost function; Distributed computing; Fault tolerance; Fault tolerant systems; Grid computing; Hardware; Large-scale systems; Middleware; Processor scheduling; Scheduling algorithm; Grid computing; directed acyclic graphs; fault-tolerance; independent tasks; primary-backup;
  • fLanguage
    English
  • Journal_Title
    Computers, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    0018-9340
  • Type

    jour

  • DOI
    10.1109/TC.2008.172
  • Filename
    4626949