Title :
Cross-Phase Optimization in MapReduce
Author :
Heintz, B. ; Chenyu Wang ; Chandra, Aniruddha ; Weissman, J.
Author_Institution :
Dept. of Comput. Sci. & Eng., Univ. of Minnesota, Minneapolis, MN, USA
Abstract :
Map Reduce has been designed to accommodate large-scale data-intensive workloads running on large single-site homogeneous clusters. Researchers have begun to explore the extent to which the original Map Reduce assumptions can be relaxed including skewed workloads, iterative applications, and heterogeneous computing environments. Our work continues this exploration by applying Map Reduce across widely distributed data over distributed computation resources. This problem arises when datasets are generated at multiple sites as is common in many scientific domains and increasingly e-commerce applications. It also occurs when multi-site resources such as geographically separated data centers are applied to the same Map Reduce job. Using Hadoop, we show that the absence of network and node homogeneity and locality of data lead to poor performance. The problem is that interaction of Map Reduce phases becomes pronounced in the presence of heterogeneous network behavior. In this paper, we propose new cross-phase optimization techniques that enable independent Map Reduce phases to influence one another. We propose techniques that optimize the push and map phases to enable push-map overlap and to allow map behavior to feed back into push dynamics. Similarly, we propose techniques that optimize the map and reduce phases to enable shuffle cost to feed back and affect map scheduling decisions. We evaluate the benefits of our techniques in both Amazon EC2 and Planet Lab. The experimental results show the potential of these techniques as performance is improved from 7%-18% depending on the execution environment and application.
Keywords :
cloud computing; computer centres; optimisation; pattern clustering; resource allocation; scheduling; Amazon EC2; Hadoop; MapReduce job; PlanetLab; cross-phase optimization; cross-phase optimization techniques; data locality; distributed computation resources; e-commerce applications; heterogeneous computing environments; heterogeneous network behavior; iterative applications; large single-site homogeneous clusters; large-scale data-intensive workloads; map scheduling decisions; multisite resources; network homogeneity; node homogeneity; push dynamics; push-map overlap; scientific domains; Bandwidth; Distributed databases; Europe; Monitoring; Optimization; Processor scheduling; Runtime; Cloud; Distributed; MapReduce; Scheduling;
Conference_Titel :
Cloud Engineering (IC2E), 2013 IEEE International Conference on
Conference_Location :
Redwood City, CA
Print_ISBN :
978-1-4673-6473-7
DOI :
10.1109/IC2E.2013.26