Title :
Scalable load balancing for mapreduce-based record linkage
Author :
Wei Yan ; Yuan Xue ; Malin, Bradley
Author_Institution :
Dept. of Electr. Eng. & Comput. Sci., Vanderbilt Univ., Nashville, TN, USA
Abstract :
Recent research has introduced load balancing schemes that are aware of the input data distribution (i.e., data profile) to mitigate data skew and fully exploit the parallel capability of the MapReduce framework to support record linkage. However, existing solutions face a significant scalability issue when applied to massive data sets with millions or billions of blocks (a basic unit in record linkage) because their data profiles can not be maintained precisely in an efficient manner. The goal of this paper is to introduce a profiling method based on the notion of a sketch, which allows for a compact scalable solution for maintaining block size statistics. In addition, we propose two load balancing algorithms to work over sketch-based profiles while solving the data skew problem associated with record linkage. We provide an analytical analysis and extensive experiments (using Hadoop), with real and controlled synthetic data sets, to illustrate the effectiveness of our solution. The experimental results show that our load balancing algorithms can decrease the overall job completion time by 71.56% and 70.73% of the default settings in Hadoop using a set of DBLP data sets, which have 2.5 to 50.4 million records.
Keywords :
data handling; resource allocation; statistics; MapReduce-based record linkage; analytical analysis; block size statistics; data skew; input data distribution; scalability issue; scalable load balancing; Algorithm design and analysis; Arrays; Couplings; Indexes; Load management; Radiation detectors; Vectors; Load Balance; MapReduce; Record Linkage; Scalability;
Conference_Titel :
Performance Computing and Communications Conference (IPCCC), 2013 IEEE 32nd International
Conference_Location :
San Diego, CA
Print_ISBN :
978-1-4799-3213-9
DOI :
10.1109/PCCC.2013.6742785