DocumentCode
3757148
Title
Improving the I/O Performance in the Reduce Phase of Hadoop
Author
Eita Fujishima;Saneyasu Yamaguchi
Author_Institution
Grad. Sch., Electr. Eng. &
fYear
2015
Firstpage
82
Lastpage
88
Abstract
Hadoop is a popular open-source MapReduce implementation. In the cases of jobs wherein all the output files of all the relevant Map tasks are transmitted and consolidated into a single Reduce task, such as in TeraSort, the single Reduce task is the bottleneck task and is I/O bounded for processing many large output files. In most cases, including TeraSort, the intermediate data, which include the output files of the Map tasks, are large and accessed sequentially. For improving the performance of these jobs, it is important to increase the sequential access performance. In this paper, we focus on Hadoop sample job TeraSort, which is a single-Reduce-tasked job, and discuss a method for improving its performance. First, we perform TeraSort and demonstrate that the single Reduce task is the bottleneck task and is I/O bounded. Second, we show the sequential I/O speed of each zone of an HDD. Third, we propose a method for improving the performance of such single-Reduce-tasked jobs. The proposed method controls block bitmaps of the filesystem and stores the intermediate files in a faster zone, i.e., the outer range, of the HDD. Lastly, we present performance evaluation with HDFS block sizes of 64 MB and 128 MB and demonstrate that our method improves the performance.
Keywords
"Linux","Kernel","Throughput","Electrical engineering","Open source software","Performance evaluation","Data processing"
Publisher
ieee
Conference_Titel
Computing and Networking (CANDAR), 2015 Third International Symposium on
Electronic_ISBN
2379-1896
Type
conf
DOI
10.1109/CANDAR.2015.24
Filename
7424693
Link To Document