Title :
A Dynamic Resource Management System for Network-Attached Accelerator Clusters
Author :
Prabhakaran, Suraj ; Iqbal, M. ; Rinke, Sebastian ; Wolf, Felix
Author_Institution :
German Res. Sch. for Simulation Sci., Lab. for Parallel Program., RWTH Aachen Univ., Aachen, Germany
Abstract :
Over the years, cluster systems have become increasingly heterogeneous by equipping cluster nodes with one or more accelerators such as graphic processing units (GPU). These devices are typically attached to a compute node via PCI Express. As a consequence, batch systems such as TORQUE/Maui and SLURM have been extended to be aware of those additional resources tightly coupled with compute nodes. Recent advances in accelerator technology have given rise to the possibility of using network-attached accelerators in addition to node-attached accelerators. However, current batch systems do not support this new usage scenario of accelerators. This work focuses on the support for batch systems for allocating network-attached accelerators. The most important feature of the proposed batch system is its ability to dynamically allocate network-attached accelerators to jobs at application runtime. We discuss our extensions to the TORQUE and Maui batch system and elaborate on its features in the Dynamic Accelerator-Cluster Architecture, which describes an integration of network-attached accelerators into a cluster system. We also evaluate the dynamic allocation scenarios and show how batch systems can be designed to provide support for more flexible and dynamic cluster systems.
Keywords :
batch processing (computers); graphics processing units; multiprocessing systems; peripheral interfaces; GPU; PCI Express; SLURM batch system; TORQUE-Maui batch system; cluster nodes; dynamic accelerator-cluster architecture; dynamic cluster systems; dynamic resource management system; graphic processing units; network-attached accelerator cluster system; node-attached accelerators; Computer architecture; Dynamic scheduling; Graphics processing units; Method of moments; Resource management; Servers; Torque; dynamic resource management; dynamic scheduling; heterogenous architectures;
Conference_Titel :
Parallel Processing (ICPP), 2013 42nd International Conference on
Conference_Location :
Lyon
DOI :
10.1109/ICPP.2013.91