DocumentCode
2999384
Title
Analysis and Optimization of Data Import with Hadoop
Author
Xu, Weijia ; Luo, Wei ; Woodward, Nicholas
Author_Institution
Texas Adv. Comput. Center, Univ. of Texas at Austin, Austin, TX, USA
fYear
2012
fDate
21-25 May 2012
Firstpage
1058
Lastpage
1066
Abstract
Data driven research has become an important part of scientific discovery in an increasing number of disciplines. In many cases, the sheer volume of data to be processed requires not only state-of-the-art computing resources but also carefully tuned and specifically developed software. These requirements are often associated with huge operational costs and significant expertise in software development. Due to its simplicity for the user and effectiveness at processing big data, Hadoop has become a popular software platform for large-scale data analysis. Using a Hadoop cluster in a remote shared infrastructure enables users to avoid the costs of maintaining a physical infrastructure. An inevitable step in using dynamically constructed Hadoop cluster is the initial importing of the data. This process is not trivial, particularly when the size of the data is large. In this paper, we evaluate the costs of importing large-scale data into a Hadoop cluster. We present a detailed analysis of the default data importing implementation in Hadoop and conduct a practical evaluation. Our evaluation includes tests with different hardware configurations, such as different network protocol and disk configurations. We also propose an implementation to improve the performance of importing data into a Hadoop cluster wherein the data is accessed directly by Data nodes during the import process.
Keywords
data analysis; optimisation; software engineering; Hadoop cluster; data driven research; data import; large-scale data analysis; optimization; scientific discovery; software development; software platform; state-of-the-art computing resources; Computational modeling; Data models; File systems; Hardware; Pipelines; Sockets; Throughput; Hadoop; cloud computing; data import; data transfter; disk I/O;
fLanguage
English
Publisher
ieee
Conference_Titel
Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2012 IEEE 26th International
Conference_Location
Shanghai
Print_ISBN
978-1-4673-0974-5
Type
conf
DOI
10.1109/IPDPSW.2012.129
Filename
6270755
Link To Document