Title :
Imbalance and Concentration in k-NN Classification
Author :
Yin, Dawei ; An, Chang ; Baird, Henry S.
Author_Institution :
Dept. of Comput. Sci. & Eng., Lehigh Univ., Bethlehem, PA, USA
Abstract :
We propose algorithms for ameliorating difficulties in fast approximate k Nearest Neighbors (kNN) classifiers that arise from imbalances among classes in numbers of samples, and from concentrations of samples in small regions of feature space. These problems can occur with a wide range of binning kNN algorithms such as k-D trees and our variant, hashed k-D trees. The principal method we discuss automatically rebalances training data and estimates concentration in each K-d hash bin separately, which then controls how many samples should be kept in each bin. We report an experiment on 86.7M training samples which shows a 7-times speedup and higher minimum per-class recall, compared to previously reported methods. The context of these experiments is the need for image classifiers able to handle an unbounded variety of inputs: in our case, highly versatile document classifiers which require training sets as large as a billion training samples.
Keywords :
cryptography; document image processing; feature extraction; image classification; trees (mathematics); binning kNN; document classifiers; feature space; image classifiers; k-D trees; k-NN classification; minimum per-class recall; training data; Accuracy; Approximation algorithms; Impurities; Nearest neighbor searches; Pixel; Runtime; Training; concentration; document content extraction; hashed K-d trees; imbalance; k Nearest Neighbors; kNN; versatile document analysis systems;
Conference_Titel :
Pattern Recognition (ICPR), 2010 20th International Conference on
Conference_Location :
Istanbul
Print_ISBN :
978-1-4244-7542-1
DOI :
10.1109/ICPR.2010.531