Title :
A Framework for Executing Long Running Jobs in Grid Environments
Author :
Markatchev, Nayden ; Kiddle, Cameron ; Simmonds, Rob
Author_Institution :
Dept. of Comput. Sci., Univ. of Calgary, Calgary, AB
Abstract :
Computational jobs that take days, weeks or months to run usually cannot be executed as a single job due to system failures and scheduling constraints. Instead the job must be split into a series of shorter jobs. Solutions for managing the execution of such jobs in grid environments must address many issues. Participating systems and their properties can change over time and therefore it is important to have dynamic resource discovery mechanisms. Data management tools are needed to manage and keep track of data that can be distributed across multiple sites. Fault tolerance is required to handle the many different errors and failures that can occur in such environments. Furthermore, support for job reconfiguration, in terms of the number of processors, run length, and memory required, is necessary to allow jobs to adapt to the heterogeneous resources they are submitted to. This paper presents a framework for executing long running jobs in grid environments that addresses the above issues. The framework automates checkpointing, migration and reconfiguration of jobs. It has been successfully tested with the GROMACS molecular dynamics simulation application in a GT4-based grid environment comprised of resources distributed across Canada.
Keywords :
fault tolerant computing; formal verification; grid computing; molecular dynamics method; resource allocation; scheduling; GROMACS molecular dynamics simulation; GT4-based grid environment; checkpointing; computational jobs; data management tools; dynamic resource discovery mechanisms; fault tolerance; grid environments; scheduling constraints; system failures; Application software; Checkpointing; Computer science; Environmental management; Fault tolerance; Grid computing; High performance computing; Mechanical factors; Processor scheduling; Testing; Adaptive Scheduling; Execution Framework; Grid Computing;
Conference_Titel :
High Performance Computing Systems and Applications, 2008. HPCS 2008. 22nd International Symposium on
Conference_Location :
Quebec City, Que.
Print_ISBN :
978-0-7695-3250-9
DOI :
10.1109/HPCS.2008.7