• DocumentCode
    2999384
  • Title

    Analysis and Optimization of Data Import with Hadoop

  • Author

    Xu, Weijia ; Luo, Wei ; Woodward, Nicholas

  • Author_Institution
    Texas Adv. Comput. Center, Univ. of Texas at Austin, Austin, TX, USA
  • fYear
    2012
  • fDate
    21-25 May 2012
  • Firstpage
    1058
  • Lastpage
    1066
  • Abstract
    Data driven research has become an important part of scientific discovery in an increasing number of disciplines. In many cases, the sheer volume of data to be processed requires not only state-of-the-art computing resources but also carefully tuned and specifically developed software. These requirements are often associated with huge operational costs and significant expertise in software development. Due to its simplicity for the user and effectiveness at processing big data, Hadoop has become a popular software platform for large-scale data analysis. Using a Hadoop cluster in a remote shared infrastructure enables users to avoid the costs of maintaining a physical infrastructure. An inevitable step in using dynamically constructed Hadoop cluster is the initial importing of the data. This process is not trivial, particularly when the size of the data is large. In this paper, we evaluate the costs of importing large-scale data into a Hadoop cluster. We present a detailed analysis of the default data importing implementation in Hadoop and conduct a practical evaluation. Our evaluation includes tests with different hardware configurations, such as different network protocol and disk configurations. We also propose an implementation to improve the performance of importing data into a Hadoop cluster wherein the data is accessed directly by Data nodes during the import process.
  • Keywords
    data analysis; optimisation; software engineering; Hadoop cluster; data driven research; data import; large-scale data analysis; optimization; scientific discovery; software development; software platform; state-of-the-art computing resources; Computational modeling; Data models; File systems; Hardware; Pipelines; Sockets; Throughput; Hadoop; cloud computing; data import; data transfter; disk I/O;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2012 IEEE 26th International
  • Conference_Location
    Shanghai
  • Print_ISBN
    978-1-4673-0974-5
  • Type

    conf

  • DOI
    10.1109/IPDPSW.2012.129
  • Filename
    6270755