Author_Institution :
Fac. of Eng., Ulster Univ., Jordanstown, UK
Abstract :
Finding nearest neighbors is a general idea that underlies many artificial intelligence tasks, including machine learning, data mining, natural language understanding, and information retrieval. This idea is explicitly used in the k-nearest neighbors algorithm (kNN), a popular classification method. In this paper, this idea is adopted in the development of a general methodology, neighborhood counting, for devising similarity functions. We turn our focus from neighbors to neighborhoods, a region in the data space covering the data point in question. To measure the similarity between two data points, we consider all neighborhoods that cover both data points. We propose to use the number of such neighborhoods as a measure of similarity. Neighborhood can be defined for different types of data in different ways. Here, we consider one definition of neighborhood for multivariate data and derive a formula for such similarity, called neighborhood counting measure or NCM. NCM was tested experimentally in the framework of kNN. Experiments show that NCM is generally comparable to VDM and its variants, the state-of-the-art distance functions for multivariate data, and, at the same time, is consistently better for relatively large k values. Additionally, NCM consistently outperforms HEOM (a mixture of Euclidean and Hamming distances), the "standard" and most widely used distance function for multivariate data. NCM has a computational complexity in the same order as the standard Euclidean distance function and NCM is task independent and works for numerical and categorical data in a conceptually uniform way. The neighborhood counting methodology is proven sound for multivariate data experimentally. We hope it works for other types of data.
Keywords :
artificial intelligence; pattern classification; Euclidean distances; Hamming distances; artificial intelligence tasks; classification method; computational complexity; data mining; information retrieval; k-nearest neighbors algorithm; machine learning; multivariate data; natural language understanding; neighborhood counting measure; similarity functions; similarity measure; state-of-the-art distance functions; Artificial intelligence; Computational complexity; Data mining; Euclidean distance; Information retrieval; Machine learning; Machine learning algorithms; Natural languages; Nearest neighbor searches; Testing; Pattern recognition; distance; machine learning; nearest neighbors; neighborhood counting measure.; similarity; Algorithms; Artificial Intelligence; Information Storage and Retrieval; Numerical Analysis, Computer-Assisted; Pattern Recognition, Automated;