DocumentCode :
260480
Title :
[phi]Sched: A Heterogeneity-Aware Hadoop Workflow Scheduler
Author :
Krish, K.R. ; Anwar, Ali ; Butt, Ali R.
Author_Institution :
Dept. of Comput. Sci., Virginia Tech, Blacksburg, VA, USA
fYear :
2014
fDate :
9-11 Sept. 2014
Firstpage :
255
Lastpage :
264
Abstract :
Enterprise Hadoop applications now routinely comprise complex workflows that are managed by specialized workflow schedulers such as Oozie. The resources are assumed to be similar or homogeneous and data locality is often the only scheduling constraint considered. However, introduction of specialized architectures and regular system upgrades lead to Hadoop data center hardware becoming increasingly heterogeneous, in that a data center may have several clusters each boasting different characteristics. However, the workflow scheduler is not aware of such heterogeneity, and thus cannot ensure that a cluster selected based on data locality is also suitable for supporting the jobs efficiently in terms of execution time and resource consumption. In this paper, we adopt a quantitative approach where we first study detailed behavior of various representative Hadoop applications running on four different hardware configurations. Next, we incorporate this information into a hardware-aware scheduler, ØSched, to improve the resource application match. To ensure that job associated data is available locally (or nearby) to a cluster in a multi-cluster deployment, we configure a single Hadoop Distributed File System (HDFS) instance across all the participating clusters. We also design and implement region-aware data placement and retrieval for HDFS in order to reduce the network overhead and achieve cluster-level data locality. We evaluate our approach using experiments on Amazon EC2 with four clusters of eight homogeneous nodes each, where each cluster has a different hardware configuration. We find that ØSched´s optimized placement of applications across the test clusters reduces the execution time of the test applications by 18.7%, on average, when compared to extant hardware oblivious scheduling. Moreover, our HDFS enhancement increases the I/O throughput by up to 23% and the average I/O rate by up to 26% for the TestDFSIO benchmark.
Keywords :
data handling; parallel processing; scheduling; ØSched; Amazon EC2; HDFS; cluster-level data locality; data center; enterprise Hadoop applications; heterogeneity-aware Hadoop workflow scheduler; network overhead; quantitative approach; region-aware data placement; single Hadoop distributed file system; Clustering algorithms; Computer architecture; File systems; Hardware; Prediction algorithms; Schedules; Substrates;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Modelling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS), 2014 IEEE 22nd International Symposium on
Conference_Location :
Paris
ISSN :
1526-7539
Type :
conf
DOI :
10.1109/MASCOTS.2014.40
Filename :
7033662
Link To Document :
بازگشت