Abstract :
Hadoop is a popular open-source MapReduce implementation. In the cases of jobs wherein all the output files of all the relevant Map tasks are transmitted and consolidated into a single Reduce task, such as in TeraSort, the single Reduce task is the bottleneck task and is I/O bounded for processing many large output files. In most cases, including TeraSort, the intermediate data, which include the output files of the Map tasks, are large and accessed sequentially. For improving the performance of these jobs, it is important to increase the sequential access performance. In this paper, we focus on Hadoop sample job TeraSort, which is a single-Reduce-tasked job, and discuss a method for improving its performance. First, we perform TeraSort and demonstrate that the single Reduce task is the bottleneck task and is I/O bounded. Second, we show the sequential I/O speed of each zone of an HDD. Third, we propose a method for improving the performance of such single-Reduce-tasked jobs. The proposed method controls block bitmaps of the filesystem and stores the intermediate files in a faster zone, i.e., the outer range, of the HDD. Lastly, we present performance evaluation with HDFS block sizes of 64 MB and 128 MB and demonstrate that our method improves the performance.
Keywords :
"Linux","Kernel","Throughput","Electrical engineering","Open source software","Performance evaluation","Data processing"