[phi]Sched: A Heterogeneity-Aware Hadoop Workflow Scheduler

Author

Krish, K.R. ; Anwar, Ali ; Butt, Ali R.

Author_Institution

Dept. of Comput. Sci., Virginia Tech, Blacksburg, VA, USA

fYear

2014

fDate

9-11 Sept. 2014

Firstpage

255

Lastpage

264

Abstract

Enterprise Hadoop applications now routinely comprise complex workflows that are managed by specialized workflow schedulers such as Oozie. The resources are assumed to be similar or homogeneous and data locality is often the only scheduling constraint considered. However, introduction of specialized architectures and regular system upgrades lead to Hadoop data center hardware becoming increasingly heterogeneous, in that a data center may have several clusters each boasting different characteristics. However, the workflow scheduler is not aware of such heterogeneity, and thus cannot ensure that a cluster selected based on data locality is also suitable for supporting the jobs efficiently in terms of execution time and resource consumption. In this paper, we adopt a quantitative approach where we first study detailed behavior of various representative Hadoop applications running on four different hardware configurations. Next, we incorporate this information into a hardware-aware scheduler, ØSched, to improve the resource application match. To ensure that job associated data is available locally (or nearby) to a cluster in a multi-cluster deployment, we configure a single Hadoop Distributed File System (HDFS) instance across all the participating clusters. We also design and implement region-aware data placement and retrieval for HDFS in order to reduce the network overhead and achieve cluster-level data locality. We evaluate our approach using experiments on Amazon EC2 with four clusters of eight homogeneous nodes each, where each cluster has a different hardware configuration. We find that ØSched´s optimized placement of applications across the test clusters reduces the execution time of the test applications by 18.7%, on average, when compared to extant hardware oblivious scheduling. Moreover, our HDFS enhancement increases the I/O throughput by up to 23% and the average I/O rate by up to 26% for the TestDFSIO benchmark.

Keywords

data handling; parallel processing; scheduling; ØSched; Amazon EC2; HDFS; cluster-level data locality; data center; enterprise Hadoop applications; heterogeneity-aware Hadoop workflow scheduler; network overhead; quantitative approach; region-aware data placement; single Hadoop distributed file system; Clustering algorithms; Computer architecture; File systems; Hardware; Prediction algorithms; Schedules; Substrates;

fLanguage

English

Publisher

ieee

Conference_Titel

Modelling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS), 2014 IEEE 22nd International Symposium on

Conference_Location

Paris

ISSN

1526-7539

Type

conf

DOI

10.1109/MASCOTS.2014.40

Filename

7033662