Title :
An incremental clustering scheme for duplicate detection in large databases
Author :
Cesario, Eugenio ; Folino, Francesco ; Manco, Giuseppe ; Pontieri, Luigi
Author_Institution :
ICAR-CNR, Rende, Italy
Abstract :
We propose an incremental algorithm for clustering duplicate tuples in large databases, which allows to assign any new tuple t to the cluster containing the database tuples which are most similar to t (and hence are likely to refer to the same real-world entity t is associated with). The core of the approach is a hash-based indexing technique that tends to assign highly similar objects to the same buckets. Empirical evaluation proves that the proposed method allows to gain considerable efficiency improvement over a state-of-art index structure for proximity searches in metric spaces.
Keywords :
database indexing; database tuples; duplicate detection; duplicate tuples; hash-based indexing; incremental clustering; index structure; large databases; metric spaces; proximity searches; Clustering algorithms; Clustering methods; Couplings; Data engineering; Delay; Extraterrestrial measurements; Indexing; Information retrieval; Scalability; Spatial databases;
Conference_Titel :
Database Engineering and Application Symposium, 2005. IDEAS 2005. 9th International
Print_ISBN :
0-7695-2404-4
DOI :
10.1109/IDEAS.2005.10