Abstract :
The metric space model abstracts many proximity or similarity problems, where the most frequently considered primitives are range and k-nearest neighbor search, leaving out the similarity join, an extremely important primitive. In fact, despite the great attention that this primitive has received in traditional and even multidimensional databases, little has been done for general metric databases. We consider a particular type of similarity join: Given two sets of objects and a distance threshold r, find all the object pairs (one from each set) at distance at most r. For this sake, we devise a new metric index, coined List of Twin Clusters, which indexes both sets jointly (instead of the natural approach of indexing one or both sets independently). Our results show significant speedups over the basic quadratic-time naive alternative. Furthermore, we show that our technique can be easily extended to other similarity join variants, e.g., finding the k-closest pairs.
Keywords :
data structures; database management systems; query processing; data structure; general metric databases; k-nearest neighbor search; metric index; metric spaces; multidimensional databases; similarity joins; twin clusters; Cleaning; Data mining; Data structures; Databases; Extraterrestrial measurements; Information retrieval; Performance evaluation; Recruitment; Resumes; Search problems; Similarity joins;