DocumentCode :
3469905
Title :
A resource manager for optimal resource selection and fault tolerance service in Grids
Author :
Lee, Hwa Min ; Chin, Sung Ho ; Lee, Jong Hyuk ; Lee, Dae Won ; Chung, Kwang Sik ; Jung, Soon Young ; Yu, Heon Chang
Author_Institution :
Dept. of Comput. Sci. Educ., Korea Univ., Seoul, South Korea
fYear :
2004
fDate :
19-22 April 2004
Firstpage :
572
Lastpage :
579
Abstract :
In this paper, we address the issues of resource management and fault tolerance in Grids. In Grids, the state of the selected resources for job execution is a primary factor that determines the computing performance. Specifically, we propose a resource manager for optimal resource selection. The resource manager automatically selects the optimal resources among candidate resources using a genetic algorithm. Typically, the probability of failure is higher in Grid computing than in a traditional parallel computing and the failure of resources affects job execution fatally. Therefore, a fault tolerance service is essential in computational Grids and Grid services are often expected to meet some minimum levels of quality of service (QoS) for desirable operation. To address this issue, we also propose fault tolerance service to satisfy QoS requirements. We extend the definition of failures, such as process failure, processor failure, and network failure, and design the fault detector and fault manager. The simulation results indicate that our approaches are promising in that (1) our resource manager finds the optimal set of resources that guarantees the optimal performance; (2) the fault detector detects the occurrence of resource failures; and (3) the fault manager guarantees that the submitted jobs complete and improves the performance of job execution due to job migration even if some failures happen.
Keywords :
fault tolerant computing; genetic algorithms; grid computing; performance evaluation; probability; processor scheduling; quality of service; resource allocation; Grid computing; QoS; computing performance; failure probability; fault detector; fault manager; fault tolerance service; genetic algorithm; job execution; job migration; network failure; optimal performance; optimal resource selection; process failure; processor failure; quality of service; resource manager; Computer networks; Distributed computing; Fault detection; Fault tolerance; Genetic algorithms; Grid computing; Parallel processing; Processor scheduling; Quality of service; Resource management;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Cluster Computing and the Grid, 2004. CCGrid 2004. IEEE International Symposium on
Print_ISBN :
0-7803-8430-X
Type :
conf
DOI :
10.1109/CCGrid.2004.1336659
Filename :
1336659
Link To Document :
بازگشت