مرکز منطقه ای اطلاع رساني علوم و فناوري - A failure-aware scheduling strategy in large-scale cluster system

DocumentCode :

1932563

Title :

A failure-aware scheduling strategy in large-scale cluster system

Author :

Linping, Wu ; Dan, Meng ; Zhan, Jianfeng ; Lei, Wang ; Bibo, Tu

Author_Institution :

Inst. of Comput. Technol., Chinese Acad. of Sci., Beijing, China

Volume :

fYear :

2006

fDate :

16-19 May 2006

Lastpage :

648

Abstract :

As the scale is expanding, node failure becomes a commonplace feature of large-scale cluster systems. As an important part of cluster operating system software, job scheduling takes charge with high efficient resource management and reasonable job scheduling. The function of job scheduling in cluster is divided into two sub-parts: job selection and node allocation. In this paper, we introduce a failure-aware scheduling strategy named LUNF (Longest Uptime Node First) node allocation policy using characterization of nodes´ failure. Simulation results show that LUNF policy do better than random node allocation policy for the system performance.

Keywords :

network operating systems; processor scheduling; resource allocation; workstation clusters; Longest Uptime Node First; cluster operating system software; failure-aware scheduling; job scheduling; job selection; large-scale cluster system; node failure characterization; random node allocation policy; resource management; scale expansion; Computers; Error analysis; Failure analysis; Large-scale systems; Operating systems; Processor scheduling; Research and development; Shape; Supercomputers; Weibull distribution;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Cluster Computing and the Grid, 2006. CCGRID 06. Sixth IEEE International Symposium on

Conference_Location :

Singapore

Print_ISBN :

0-7695-2585-7

Type :

conf

DOI :

10.1109/CCGRID.2006.4

Filename :

1630882

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1932563