Improving the I/O Performance in the Reduce Phase of Hadoop

Author

Eita Fujishima;Saneyasu Yamaguchi

Author_Institution

Grad. Sch., Electr. Eng. &

fYear

2015

Firstpage

82

Lastpage

88

Abstract

Hadoop is a popular open-source MapReduce implementation. In the cases of jobs wherein all the output files of all the relevant Map tasks are transmitted and consolidated into a single Reduce task, such as in TeraSort, the single Reduce task is the bottleneck task and is I/O bounded for processing many large output files. In most cases, including TeraSort, the intermediate data, which include the output files of the Map tasks, are large and accessed sequentially. For improving the performance of these jobs, it is important to increase the sequential access performance. In this paper, we focus on Hadoop sample job TeraSort, which is a single-Reduce-tasked job, and discuss a method for improving its performance. First, we perform TeraSort and demonstrate that the single Reduce task is the bottleneck task and is I/O bounded. Second, we show the sequential I/O speed of each zone of an HDD. Third, we propose a method for improving the performance of such single-Reduce-tasked jobs. The proposed method controls block bitmaps of the filesystem and stores the intermediate files in a faster zone, i.e., the outer range, of the HDD. Lastly, we present performance evaluation with HDFS block sizes of 64 MB and 128 MB and demonstrate that our method improves the performance.

Keywords

"Linux","Kernel","Throughput","Electrical engineering","Open source software","Performance evaluation","Data processing"

Publisher

ieee

Conference_Titel

Computing and Networking (CANDAR), 2015 Third International Symposium on

Electronic_ISBN

2379-1896

Type

conf

DOI

10.1109/CANDAR.2015.24

Filename

7424693