DocumentCode :
2552813
Title :
Reliability-aware resource management for computational grid/cluster environments
Author :
Limaye, Kshitij ; Leangsuksun, Box ; Liu, Yudan ; Greenwood, Zeno ; Scott, Stephen L. ; Libby, Richard ; Chanchio, Kasidit
Author_Institution :
Louisiana Tech Univ., Ruston, LA, USA
fYear :
2005
fDate :
13-14 Nov. 2005
Abstract :
The collective resource utilization achieved through grid computing is critical to the overall computing capacity of the collaborative community and should be guaranteed. Especially, in an existing environment where job sites are Beowulf cluster systems, a service node failure may render the whole system outage. Current grid fault tolerance techniques only address these issues in an opportunistic fashion. Thus, there is a need for complementing these approaches by pro actively handling failures at a job-site level, ensuring the system high availability with no loss of user submitted jobs. Our grid-aware cluster resource management effort was motivated by the fact that a cluster turns into a popular job site in the computational grid environment. We propose a solution dealing with fault tolerance at the service level complementing the task-based solutions as being done in some recent studies. We discuss various service availability issues related to the grid, and preliminary results obtained while implementing the smart failover and transparent job-queue replication mechanism and the automated grid installation package. Our report entails the benefits outweighing acceptable overhead after implementing our proof-of-concept framework.
Keywords :
fault tolerant computing; grid computing; groupware; resource allocation; workstation clusters; Beowulf cluster systems; collaborative community; collective resource utilization; computational grid environments; failure handling; grid computing; grid fault tolerance; grid-aware cluster resource management; reliability-aware resource management; service availability; service node failure; smart failover; transparent job-queue replication; Availability; Collaboration; Distributed computing; Fault tolerance; Grid computing; Laboratories; Packaging; Physics; Problem-solving; Resource management;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Grid Computing, 2005. The 6th IEEE/ACM International Workshop on
Print_ISBN :
0-7803-9492-5
Type :
conf
DOI :
10.1109/GRID.2005.1542744
Filename :
1542744
Link To Document :
بازگشت