Title :
Lark: Bringing Network Awareness to High Throughput Computing
Author :
Zhe Zhang ; Bockelman, Brian ; Carder, Dale W. ; Tannenbaum, Todd
Abstract :
High throughput computing (HTC) systems are widely adopted in scientific discovery and engineering research. They are responsible for scheduling submitted batch jobs to utilize the cluster resources. Current systems mostly focus on managing computing resources like CPU and memory, however, they lack flexible and fine-grained management mechanisms for network resources. This has increasingly been an urgent need as current batch systems may be distributed among dozens of sites around the globe like Open Science Grid. The Lark project was motivated by this need to re-examine how the HTC layer interacts with the network layer. In this paper, we present the system architecture of Lark and its implementation as a plugin of HTCondor which is a popular HTC software project. Lark achieves lightweight network virtualization at per-job granularity for HTCondor by utilizing Linux container and virtual Ethernet devices, this provides each batch job with a unique network address in a private network namespace. We extended HTCondor´s description language, Class Ads, so users can specify networking requirements in the job submission script. HTCondor can perform matchmaking to make sure user-specified network requirements and resource-specific policies are fulfilled. We also extended the job agent, condor starter, so that it can manage and configure the job´s network environment. Given this important building block as the core, we implement bandwidth management functionality at both the host and network levels utilizing software-defined networking (SDN). Our experiments and evaluations show that Lark can effectively manage network resources within the cluster with low overhead. It provides the users with better predictability of their job execution and the administrators more flexibility in network resource consumption policies.
Keywords :
Linux; local area networks; natural sciences computing; parallel processing; scheduling; software defined networking; virtualisation; CPU; ClassAds; HTC layer; HTC software project; HTCondor; Lark project; Linux container; SDN; bandwidth management functionality; cluster resources; computing resources; engineering research; fine-grained management mechanisms; flexible management mechanisms; high throughput computing; job execution; lightweight network virtualization; memory; network address; network awareness; network layer; open science grid; private network namespace; resource consumption policies; resource-specific policies; scientific discovery; software-defined networking; starter; submitted batch jobs scheduling; user-specified network requirements; virtual Ethernet devices; Bandwidth; Bridges; Hardware; IP networks; Kernel; Linux; HTCondor; bandwidth management; high throughput computing; network-aware scheduling; software-defined networking;
Conference_Titel :
Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on
Conference_Location :
Shenzhen
DOI :
10.1109/CCGrid.2015.116