مرکز منطقه ای اطلاع رساني علوم و فناوري - Job-Site Level Fault Tolerance for Cluster and Grid environments

DocumentCode :

2384278

Title :

Job-Site Level Fault Tolerance for Cluster and Grid environments

Author :

Limaye, Kshitij ; Leangsuksun, Box ; Greenwood, Zeno ; Scott, Stephen L. ; Engelmann, Christian ; Libby, Richard ; Chanchio, Kasidit

Author_Institution :

Louisiana Tech Univ., Ruston, LA

fYear :

2005

fDate :

Sept. 2005

Firstpage :

Lastpage :

Abstract :

In order to adopt high performance clusters and grid computing for mission critical applications, fault tolerance is a necessity. Common fault tolerance techniques in distributed systems are normally achieved with checkpoint-recovery and job replication on alternative resources, in cases of a system outage. The first approach depends on the system\´s MTTR while the latter approach depends on the availability of alternative sites to run replicas. There is a need for complementing these approaches by proactively handling failures at a job-site level, ensuring the system high availability with no loss of user submitted jobs. This paper discusses a novel fault tolerance technique that enables the job-site recovery in Beowulf cluster-based grid environments, whereas existing techniques give up a failed system by seeking alternative resources. Our results suggest sizable aggregate performance improvement during an implementation of our method in Globus-enabled HA-OSCAR. The technique called \´\´smart failover" provides a transparent and graceful recovery mechanism that saves job states in a local job-manager queue and transfers those states to the backup server periodically, and in critical system events. Thus whenever a failover occurs, the backup server is able to restart the jobs from their last saved state

Keywords :

checkpointing; fault tolerant computing; grid computing; Beowulf cluster; Globus; HA-OSCAR; checkpoint-recovery; cluster environment; distributed system; fault tolerance; grid computing; job replication; job-site recovery; smart failover; Aggregates; Availability; Collaborative work; Contracts; Distributed computing; Fault tolerance; Fault tolerant systems; Grid computing; Mission critical systems; Scheduling;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Cluster Computing, 2005. IEEE International

Conference_Location :

Burlington, MA

ISSN :

1552-5244

Print_ISBN :

0-7803-9486-0

Electronic_ISBN :

1552-5244

Type :

conf

DOI :

10.1109/CLUSTR.2005.347043

Filename :

4154086

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2384278