مرکز منطقه ای اطلاع رساني علوم و فناوري - Improving ReduceTask data locality for sequential MapReduce jobs

DocumentCode :

623753

Title :

Improving ReduceTask data locality for sequential MapReduce jobs

Author :

Jian Tan ; Shicong Meng ; Xiaoqiao Meng ; Li Zhang

Author_Institution :

IBM T. J. Watson Res. Center, Yorktown Heights, NY, USA

fYear :

2013

fDate :

14-19 April 2013

Firstpage :

1627

Lastpage :

1635

Abstract :

Improving data locality for MapReduce jobs is critical for the performance of large-scale Hadoop clusters, embodying the principle of moving computation close to data for big data platforms. Scheduling tasks in the vicinity of stored data can significantly diminish network traffic, which is crucial for system stability and efficiency. Though the issue on data locality has been investigated extensively for MapTasks, most of the existing schedulers ignore data locality for ReduceTasks when fetching the intermediate data, causing performance degradation. This problem of reducing the fetching cost for ReduceTasks has been identified recently. However, the proposed solutions are exclusively based on a greedy approach, relying on the intuition to place ReduceTasks to the slots that are closest to the majority of the already generated intermediate data. The consequence is that, in presence of job arrivals and departures, assigning the ReduceTasks of the current job to the nodes with the lowest fetching cost can prevent a subsequent job with even better match of data locality from being launched on the already taken slots. To this end, we formulate a stochastic optimization framework to improve the data locality for ReduceTasks, with the optimal placement policy exhibiting a threshold-based structure. In order to ease the implementation, we further propose a receding horizon control policy based on the optimal solution under restricted conditions. The improved performance is further validated through simulation experiments and real performance tests on our testbed.

Keywords :

data handling; predictive control; scheduling; stochastic programming; MapTask; ReduceTask data locality improvement; fetching cost reduction; greedy approach; intermediate data fetching; job arrivals; job departure; large-scale Hadoop clusters; moving computation principle; network traffic; optimal placement policy; receding horizon control policy; sequential MapReduce jobs; stochastic optimization framework; task scheduling; threshold-based structure; Bismuth; Indexes; Network topology; Optimization; Random variables; System performance; Virtual machining;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

INFOCOM, 2013 Proceedings IEEE

Conference_Location :

Turin

ISSN :

0743-166X

Print_ISBN :

978-1-4673-5944-3

Type :

conf

DOI :

10.1109/INFCOM.2013.6566959

Filename :

6566959

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=623753