Title :
Scalable and robust key group size estimation for reducer load balancing in MapReduce
Author :
Wei Yan ; Yuan Xue ; Malin, Bradley
Author_Institution :
Dept. of Electr. Eng. & Comput. Sci., Vanderbilt Univ., Nashville, TN, USA
Abstract :
Modern parallel computing systems, such as MapReduce, often assume data values are uniformly distributed. However, in the real world, data is often highly skewed, which may cause workload imbalance among parallel running tasks. In this paper, we study the reduce-phase skew problem in MapReduce, where reduce tasks are often assgined imbalance load (in terms of key groups). We introduce a sketch-based data structure for capturing MapReduce key group size statistics and present an optimal packing algorithm which assigns the key groups to the reducers in a load balancing manner. We perform an empirical evaluation with several real and synthetic datasets over two distinct types of applications. The results show that our load balancing algorithm can strongly mitigate the reduce-phase skew. It can decrease the overall job completion time by 45.5% of the default settings in Hadoop and by 38.3% in comparison to the state-of-the-art solution.
Keywords :
data structures; parallel programming; resource allocation; statistical analysis; Hadoop; MapReduce; data values; empirical evaluation; job completion time; key group size estimation; key group size statistics; load balancing reduction; optimal packing algorithm; parallel computing systems; parallel running tasks; reduce-phase skew problem; sketch-based data structure; workload imbalance; Algorithm design and analysis; Data structures; Estimation; High definition video; Load management; Partitioning algorithms; Radiation detectors; Estimation; Key Group Size; Load Skew; MapReduce;
Conference_Titel :
Big Data, 2013 IEEE International Conference on
Conference_Location :
Silicon Valley, CA
DOI :
10.1109/BigData.2013.6691568