• DocumentCode
    3757148
  • Title

    Improving the I/O Performance in the Reduce Phase of Hadoop

  • Author

    Eita Fujishima;Saneyasu Yamaguchi

  • Author_Institution
    Grad. Sch., Electr. Eng. &
  • fYear
    2015
  • Firstpage
    82
  • Lastpage
    88
  • Abstract
    Hadoop is a popular open-source MapReduce implementation. In the cases of jobs wherein all the output files of all the relevant Map tasks are transmitted and consolidated into a single Reduce task, such as in TeraSort, the single Reduce task is the bottleneck task and is I/O bounded for processing many large output files. In most cases, including TeraSort, the intermediate data, which include the output files of the Map tasks, are large and accessed sequentially. For improving the performance of these jobs, it is important to increase the sequential access performance. In this paper, we focus on Hadoop sample job TeraSort, which is a single-Reduce-tasked job, and discuss a method for improving its performance. First, we perform TeraSort and demonstrate that the single Reduce task is the bottleneck task and is I/O bounded. Second, we show the sequential I/O speed of each zone of an HDD. Third, we propose a method for improving the performance of such single-Reduce-tasked jobs. The proposed method controls block bitmaps of the filesystem and stores the intermediate files in a faster zone, i.e., the outer range, of the HDD. Lastly, we present performance evaluation with HDFS block sizes of 64 MB and 128 MB and demonstrate that our method improves the performance.
  • Keywords
    "Linux","Kernel","Throughput","Electrical engineering","Open source software","Performance evaluation","Data processing"
  • Publisher
    ieee
  • Conference_Titel
    Computing and Networking (CANDAR), 2015 Third International Symposium on
  • Electronic_ISBN
    2379-1896
  • Type

    conf

  • DOI
    10.1109/CANDAR.2015.24
  • Filename
    7424693