A fast exact k-nearest neighbors algorithm for high dimensional search using k-means clustering and triangle inequality

Author

Wang, Xueyi

Author_Institution

Dept. of Math. & Comput. Sci., Northwest Nazarene Univ., Nampa, ID, USA

fYear

2011

fDate

July 31 2011-Aug. 5 2011

Firstpage

1293

Lastpage

1299

Abstract

The k-nearest neighbors (k-NN) algorithm is a widely used machine learning method that finds nearest neighbors of a test object in a feature space. We present a new exact k-NN algorithm called kMkNN (k-Means for k-Nearest Neighbors) that uses the k-means clustering and the triangle inequality to accelerate the searching for nearest neighbors in a high dimensional space. The kMkNN algorithm has two stages. In the buildup stage, instead of using complex tree structures such as metric trees, kd-trees, or ball-tree, kMkNN uses a simple k-means clustering method to preprocess the training dataset. In the searching stage, given a query object, kMkNN finds nearest training objects starting from the nearest cluster to the query object and uses the triangle inequality to reduce the distance calculations. Experiments show that the performance of kMkNN is surprisingly good compared to the traditional k-NN algorithm and tree-based k-NN algorithms such as kd-trees and ball-trees. On a collection of 20 datasets with up to 10⁶ records and 10⁴ dimensions, kMkNN shows a 2- to 80-fold reduction of distance calculations and a 2- to 60-fold speedup over the traditional k-NN algorithm for 16 datasets. Furthermore, kMkNN performs significant better than a kd-tree based k-NN algorithm for all datasets and performs better than a ball-tree based k-NN algorithm for most datasets. The results show that kMkNN is effective for searching nearest neighbors in high dimensional spaces.

Keywords

learning (artificial intelligence); pattern clustering; trees (mathematics); ball-tree; complex tree structures; exact k-NN algorithm; exact k-nearest neighbors algorithm; feature space; high dimensional search; k-means clustering method; k-means for k-nearest neighbors; kd-trees; machine learning method; metric trees; tree-based k-NN algorithms; triangle inequality; Acceleration; Algorithm design and analysis; Classification algorithms; Clustering algorithms; Machine learning algorithms; Search problems; Training;

fLanguage

English

Publisher

ieee

Conference_Titel

Neural Networks (IJCNN), The 2011 International Joint Conference on

Conference_Location

San Jose, CA

ISSN

2161-4393

Print_ISBN

978-1-4244-9635-8

Type

conf

DOI

10.1109/IJCNN.2011.6033373

Filename

6033373