DocumentCode :
2533842
Title :
An incremental clustering scheme for duplicate detection in large databases
Author :
Cesario, Eugenio ; Folino, Francesco ; Manco, Giuseppe ; Pontieri, Luigi
Author_Institution :
ICAR-CNR, Rende, Italy
fYear :
2005
fDate :
25-27 July 2005
Firstpage :
89
Lastpage :
95
Abstract :
We propose an incremental algorithm for clustering duplicate tuples in large databases, which allows to assign any new tuple t to the cluster containing the database tuples which are most similar to t (and hence are likely to refer to the same real-world entity t is associated with). The core of the approach is a hash-based indexing technique that tends to assign highly similar objects to the same buckets. Empirical evaluation proves that the proposed method allows to gain considerable efficiency improvement over a state-of-art index structure for proximity searches in metric spaces.
Keywords :
database indexing; database tuples; duplicate detection; duplicate tuples; hash-based indexing; incremental clustering; index structure; large databases; metric spaces; proximity searches; Clustering algorithms; Clustering methods; Couplings; Data engineering; Delay; Extraterrestrial measurements; Indexing; Information retrieval; Scalability; Spatial databases;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Database Engineering and Application Symposium, 2005. IDEAS 2005. 9th International
ISSN :
1098-8068
Print_ISBN :
0-7695-2404-4
Type :
conf
DOI :
10.1109/IDEAS.2005.10
Filename :
1540899
Link To Document :
بازگشت