Title :
Improving the Shuffle of Hadoop MapReduce
Author :
Jingui Li ; Xuelian Lin ; Xiaolong Cui ; Yue Ye
Author_Institution :
Sch. of Comput. Sci. & Eng., Beihang Univ., Beijing, China
Abstract :
As an efficient parallel computing system based on MapReduce model, Hadoop is widely used for large-scale data analysis such as data mining, machine learning and scientific simulation. However, there are still some performance problems in MapReduce, especially the situation in the shuffle phase. In order to solve these problems, in this paper, a lightweight individual shuffle service component with more efficient I/O policy was proposed rather than the existing shuffle phase in MapReduce. We also describe how to implement the shuffle service in three steps: extract shuffle from reduce task as a shuffle task, reconstruct the shuffle task as a service and improve I/O scheduling policy on Map sides. Furthermore both simulated experiments and MapReduce job comparative studies are conducted to evaluate the performance of our improvements. The result reveals that our approach can decrease the whole job´s execution time and make full use of cluster resources.
Keywords :
data analysis; data mining; input-output programs; learning (artificial intelligence); parallel programming; public domain software; software performance evaluation; Hadoop MapReduce shuffle improvement; I-O scheduling policy improvement; Map sides; data mining; large-scale data analysis; machine learning; parallel computing system; performance evaluation; reduce task; scientific simulation; shuffle extraction; shuffle service component; shuffle task-as-a-service; Bandwidth; Computational modeling; Data models; Facebook; Google; Memory management; Protocols; hadoop; mapreduce; shuffle;
Conference_Titel :
Cloud Computing Technology and Science (CloudCom), 2013 IEEE 5th International Conference on
Conference_Location :
Bristol
DOI :
10.1109/CloudCom.2013.42