DocumentCode :
1175206
Title :
Probabilistic Reservation Services for Large-Scale Batch-Scheduled Systems
Author :
Nurmi, Daniel ; Wolski, Rich ; Brevik, John
Author_Institution :
Comput. Sci. Dept., Univ. of California Santa Barbara, Santa Barbara, CA
Volume :
3
Issue :
1
fYear :
2009
fDate :
3/1/2009 12:00:00 AM
Firstpage :
6
Lastpage :
24
Abstract :
In high-performance computing (HPC) settings, in which multiprocessor machines are shared among users with potentially competing resource demands, processors are allocated to user workload using space sharing. Typically, users interact with a given machine by submitting their jobs to a centralized batch scheduler that implements a site-specific, and often partially hidden, policy designed to maximize machine utilization while providing tolerable turnaround times. In practice, while most HPC systems experience good utilization levels, the amount of time experienced by individual jobs waiting to begin execution has been shown to be highly variable and difficult to predict, leading to user confusion and/or frustration. One method for dealing with this uncertainty that has been proposed is the ability to predict the amount of time that individual jobs will wait in batch queues once they are submitted, thus allowing a user to reason about the total time between job submission and job completion (which we term a job´s ldquooverall turnaround timerdquo). Another related but distinct method for handling the uncertainty is to allow users who are willing to plan ahead to make ldquoadvanced reservationsrdquo for processor resources, again allowing them to reason about job turnaround time. To date, however, few if any HPC centers provide either job-queue delay prediction services or advanced reservation capabilities to their general user populations. In this paper, we describe QBETS, VARQ, and CO-VARQ, new methods for allowing users to reason and control the overall turnaround time of their batch-queue jobs submitted to busy HPC systems in existence today. QBETS is an online, non-parametric system for predicting statistical bounds on the amount of time individual batch jobs will wait in queue. VARQ is a new method for job scheduling that provides users with probabilistic ldquovirtualrdquo advanced reservations using only existing best effort batch schedulers and policies, and - - CO-VARQ utilizes this capability to implement a general coallocation service. QBETS, VARQ and CO-VARQ operate as overlays, requiring no modification to the local scheduler implementation or policies. We describe the statistical methods we use to implement the systems, detail empirical evaluations of their effectiveness in a number of HPC settings, and explore the potential future impact of these systems should they become widely used.
Keywords :
Web services; processor scheduling; uncertainty handling; advanced reservations; batch queues; centralized batch scheduler; high-performance computing; job scheduling; job-queue delay prediction services; large-scale batch-scheduled systems; multiprocessor machines; probabilistic reservation services; processor resources; space sharing; uncertainty handling; Computer science; Control systems; Delay; Job design; Large-scale systems; Processor scheduling; Production; Resource management; Statistical analysis; Uncertainty;
fLanguage :
English
Journal_Title :
Systems Journal, IEEE
Publisher :
ieee
ISSN :
1932-8184
Type :
jour
DOI :
10.1109/JSYST.2008.2011303
Filename :
4787228
Link To Document :
بازگشت