Analysis and Optimization of Data Import with Hadoop

Author

Xu, Weijia ; Luo, Wei ; Woodward, Nicholas

Author_Institution

Texas Adv. Comput. Center, Univ. of Texas at Austin, Austin, TX, USA

fYear

2012

fDate

21-25 May 2012

Firstpage

1058

Lastpage

1066

Abstract

Data driven research has become an important part of scientific discovery in an increasing number of disciplines. In many cases, the sheer volume of data to be processed requires not only state-of-the-art computing resources but also carefully tuned and specifically developed software. These requirements are often associated with huge operational costs and significant expertise in software development. Due to its simplicity for the user and effectiveness at processing big data, Hadoop has become a popular software platform for large-scale data analysis. Using a Hadoop cluster in a remote shared infrastructure enables users to avoid the costs of maintaining a physical infrastructure. An inevitable step in using dynamically constructed Hadoop cluster is the initial importing of the data. This process is not trivial, particularly when the size of the data is large. In this paper, we evaluate the costs of importing large-scale data into a Hadoop cluster. We present a detailed analysis of the default data importing implementation in Hadoop and conduct a practical evaluation. Our evaluation includes tests with different hardware configurations, such as different network protocol and disk configurations. We also propose an implementation to improve the performance of importing data into a Hadoop cluster wherein the data is accessed directly by Data nodes during the import process.

Keywords

data analysis; optimisation; software engineering; Hadoop cluster; data driven research; data import; large-scale data analysis; optimization; scientific discovery; software development; software platform; state-of-the-art computing resources; Computational modeling; Data models; File systems; Hardware; Pipelines; Sockets; Throughput; Hadoop; cloud computing; data import; data transfter; disk I/O;

fLanguage

English

Publisher

ieee

Conference_Titel

Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2012 IEEE 26th International

Conference_Location

Shanghai

Print_ISBN

978-1-4673-0974-5

Type

conf

DOI

10.1109/IPDPSW.2012.129

Filename

6270755