DocumentCode :
653973
Title :
Data Pipeline in MapReduce
Author :
Jiaan Zeng ; Plale, Beth
Author_Institution :
Sch. of Inf. & Comput., Indiana Univ. Bloomington, Bloomington, IN, USA
fYear :
2013
fDate :
22-25 Oct. 2013
Firstpage :
164
Lastpage :
171
Abstract :
MapReduce is an effective programming model for large scale text and data analysis. Traditional MapReduce implementation, e.g., Hadoop, has the restriction that before any analysis can take place, the entire input dataset must be loaded into the cluster. This can introduce sizable latency when the data set is large, and when it is not possible to load the data once, and process many times - a situation that exists for log files, health records and protected texts for instance. We propose a data pipeline approach to hide data upload latency in MapReduce analysis. Our implementation, which is based on Hadoop MapReduce, is completely transparent to user. It introduces a distributed concurrency queue to coordinate data block allocation and synchronization so as to overlap data upload and execution. The paper overcomes two challenges: a fixed number of maps scheduling and dynamic number of maps scheduling allows for better handling of input data sets of unknown size. We also employ delay scheduler to achieve data locality for data pipeline. The evaluation of the solution on different applications on real world data sets shows that our approach shows performance gains.
Keywords :
data analysis; distributed processing; pipeline processing; Hadoop MapReduce; MapReduce analysis; MapReduce implementation; data analysis; data block allocation; data pipeline; data set; health records; log files; programming model; protected texts; Concurrent computing; Delays; Distributed databases; Dynamic scheduling; Equations; Pipelines; Schedules; MapReduce; data pipeline;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
eScience (eScience), 2013 IEEE 9th International Conference on
Conference_Location :
Beijing
Type :
conf
DOI :
10.1109/eScience.2013.21
Filename :
6683904
Link To Document :
بازگشت