Author :
Zhang, Yuezhuo ; Zha, Li ; Liu, Jia ; Xu, Zhiwei
Abstract :
In this paper, we propose an online search system based on Key-Value Store which aims to provide real-time k-NN (k-Nearest Neighbor) search in large-scale high-dimensional vector spaces. Through an improved indexing method based on KD-tree, the vector space is divided into a number of fixed-size heaps, only vectors of a specified heap need to do k-NN calculation, the time complexity is O(1). So, the key step becomes to search a heap in the Key-Value Store by heap ID, we approximate the global computing problem of Euclidean distance k-NN search into a look-up table problem. Therefore, on the one hand, the search cost of a single vector mainly depends on the progress to find the value of a given key in Key-Value Store, whose time complexity is O (log2N). On the other hand, because of the linear horizontal scaling properties of the Key-Value Store, we can increase the number of hosts to linearly increase the throughput of the entire cluster. So, maintaining a fixed proportion of the cluster size and the target vector space can obtain relatively fixed search performance, and complete the search process in a relatively fixed period of time. As a result", "even if the vector space is very large, we can also achieve an effective online search by enough number of hosts. Firstly, this paper introduces the main processes of the system, then the indexing algorithm and vector data storage format is described. Finally, we take the Euclidean distance of the 128-dimensional SIFT feature vectors k-NN search as load to give the performance assessment.
Keywords :
Internet; computational complexity; content-based retrieval; indexing; pattern clustering; performance evaluation; real-time systems; search engines; storage management; 128-dimensional SIFT feature vectors; Euclidean distance k-NN search; KD-tree-based indexing method; cluster size; fixed-size heaps; global computing problem approximation; indexing algorithm; key-value store; large-scale high-dimensional vector spaces; large-scale online search system; linear horizontal scaling properties; look-up table problem; performance assessment; real-time k-NN search; real-time k-nearest neighbor search; search performance; time complexity; vector data storage format; Euclidean distance; Indexing; Libraries; Search problems; Vectors; Key-Value Store; high-dimensional; k-NN; large-scale; online search system;