Title :
List of twin clusters: a data structure for similarity joins in metric spaces
Author :
Paredes, Rodrigo ; Reyes, Nora
Author_Institution :
Depto. de Cienc. de la Comput., Univ. de Chile, Santiago
Abstract :
The metric space model abstracts many proximity or similarity problems, where the most frequently considered primitives are range and k-nearest neighbor search, leaving out the similarity join, an extremely important primitive. In fact, despite the great attention that this primitive has received in traditional and even multidimensional databases, little has been done for general metric databases. We consider a particular type of similarity join: Given two sets of objects and a distance threshold r, find all the object pairs (one from each set) at distance at most r. For this sake, we devise a new metric index, coined List of Twin Clusters, which indexes both sets jointly (instead of the natural approach of indexing one or both sets independently). Our results show significant speedups over the basic quadratic-time naive alternative. Furthermore, we show that our technique can be easily extended to other similarity join variants, e.g., finding the k-closest pairs.
Keywords :
data structures; database indexing; pattern clustering; query processing; data structure; dataset indexing; k-nearest neighbor search; metric index; metric space model; range query; similarity join algorithms; twin clusters; Cleaning; Data mining; Data structures; Databases; Extraterrestrial measurements; Information retrieval; Performance evaluation; Recruitment; Resumes; Search problems;
Conference_Titel :
Data Engineering Workshop, 2008. ICDEW 2008. IEEE 24th International Conference on
Conference_Location :
Cancun
Print_ISBN :
978-1-4244-2161-9
Electronic_ISBN :
978-1-4244-2162-6
DOI :
10.1109/ICDEW.2008.4498353