• DocumentCode
    580090
  • Title

    Supporting fault-tolerance for time-critical events in distributed environments

  • Author

    Qian Zhu ; Agrawal, Gagan

  • Author_Institution
    Dept. of Comput. Sci. & Eng., Ohio State Univ., Columbus, OH, USA
  • fYear
    2009
  • fDate
    14-20 Nov. 2009
  • Firstpage
    1
  • Lastpage
    12
  • Abstract
    In this paper, we consider the problem of supporting fault tolerance for adaptive and time-critical applications in heterogeneous and unreliable grid computing environments. Our goal for this class of applications is to optimize a user-specified benefit function while meeting the time deadline. Our first contribution in this paper is a multi-objective optimization algorithm for scheduling the application onto the most efficient and reliable resources. In this way, the processing can achieve the maximum benefit while also maximizing the success-rate, which is the probability of finishing execution without failures. However, for the cases where failures do occur, we have developed a hybrid failure-recovery scheme to ensure that the application can complete within the pre-specified time interval. Our experimental results show that our scheduling algorithm can achieve better benefit when compared to several heuristics-based greedy scheduling algorithms, while still having a negligible overhead. Benefit is further improved when we apply the hybrid failure recovery scheme, and the success-rate becomes 100%.
  • Keywords
    fault tolerant computing; greedy algorithms; grid computing; optimisation; probability; scheduling; system recovery; distributed environments; fault-tolerance; heterogeneous grid computing; heuristics-based greedy scheduling algorithms; hybrid failure recovery scheme; hybrid failure-recovery scheme; multiobjective optimization algorithm; probability; time deadline; time-critical events; unreliable grid computing environments; user-specified benefit function;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High Performance Computing Networking, Storage and Analysis, Proceedings of the Conference on
  • Conference_Location
    Portland, OR
  • Type

    conf

  • DOI
    10.1145/1654059.1654092
  • Filename
    6375538