A Grid-Based k-Nearest Neighbor Join for Large Scale Datasets on MapReduce

Author

Miyoung Jang;Young-Sung Shin;Jae-Woo Chang

Author_Institution

Dept. of Comput. Eng., Chonbuk Nat. Univ., Jeonju, South Korea

fYear

2015

Firstpage

888

Lastpage

891

Abstract

Because MapReduce supports efficient parallel data processing, MapReduce-based query processing algorithms have been widely studied. Among various query types, k-nearest neighbor join, which aims to produce the k nearest neighbors of each point of a dataset from another dataset, has been considered most important in data analysis. Existing k-NN join query processing algorithms on MapReduce suffer from high index construction and computation costs which make them unsuitable for big data processing. In this paper, we propose a new grid-based k-NN join query processing algorithm on MapReduce. First, we design a dynamic grid index that represents the distribution of join datasets. Based on this index, we prune out unnecessary cells for the join with the distance-based filtering. This can reduce the data transmission and computation overheads. From performance analysis, we show that our algorithm outperforms the existing scheme up to seven times in terms of query processing time while achieving high query result accuracy.

Keywords

"Partitioning algorithms","Algorithm design and analysis","Query processing","Indexes","Data processing","Heuristic algorithms","Accuracy"

Publisher

ieee

Conference_Titel

High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conferen on Embedded Software and Systems (ICESS), 2015 IEEE 17th International Conference on

Type

conf

DOI

10.1109/HPCC-CSS-ICESS.2015.189

Filename

7336277