Author_Institution :
Dept. of Comput. Sci. & Technol., Xi´an Jiaotong Univ., Xi´an, China
Abstract :
MapReduce has emerged as a popular computing model used in datacenters to process large amount of datasets. In the map phase, hash partitioning is employed to distribute data that sharing the same key across data center-scale cluster nodes. However, we observe that this approach can lead to uneven data distribution, which can result in skewed loads among reduce tasks, thus hamper performance of MapReduce systems. Moreover, worker nodes in MapReduce systems may differ in computing capability due to (1) multiple generations of hardware in non-virtualized data centers, or (2) co-location of virtual machines in virtualized data centers. The heterogeneity among cluster nodes exacerbates the negative effects of uneven data distribution. To improve MapReduce performance in heterogeneous clusters, we propose a novel load balancing approach in the reduce phase. This approach consists of two components: (1) performance prediction for reducers that run on heterogeneous nodes based on support vector machines models, and (2) heterogeneity-aware partitioning (HAP), which balances skewed data for reduce tasks. We implement this approach as a plug-in in current MapReduce system. Experimental results demonstrate that our proposed approach distributes work evenly among reduce tasks, and improves MapReduce performance with little overhead.
Keywords :
cloud computing; computer centres; file organisation; parallel programming; resource allocation; software performance evaluation; support vector machines; virtual machines; HAP; MapReduce performance improvement; cloud computing; computing model; data center-scale cluster nodes; hash partitioning; heterogeneity-aware partitioning; heterogeneous nodes; map phase; multiple hardware generations; nonvirtualized data centers; performance prediction; skewed load balancing; support vector machines models; task reduction; uneven data distribution; virtual machines colocation; virtualized data centers; Cloud computing; Computational modeling; Data models; Load management; Load modeling; Support vector machines; Virtual machining; MapReduce; cloud computing; performance prediction; skewed loads; support vector machines;