Title :
IKMC: An Improved K-Medoids Clustering Method for Near-Duplicated Records Detection
Author :
Pei, Ying ; Xu, Jungang ; Cen, Zhiwang ; Sun, Jian
Author_Institution :
Sch. of Inf. Sci. & Eng., Grad. Univ. of Chinese Acad. of Sci., Beijing, China
Abstract :
An improved K-medoids clustering algorithm (IKMC) to resolve the problem of detecting the near-duplicated records is proposed in this paper. It considers every record in database as one separate data object, uses edit-distance method and the weights of attributes to get similarity value among records, then detect duplicated records by clustering these similarity value. This algorithm can automatically adjust the number of clusters through comparing the similarity value with the preset similarity threshold, and avoid a large numbers of I/O operations used by traditional "sort/merge" algorithm for sequencing. Through the experiment, this algorithm is proved to have good detection accuracy and high availability.
Keywords :
database management systems; merging; pattern clustering; records management; sorting; database; edit-distance method; improved K-medoids clustering method; merge algorithm; near-duplicated records detection; sort algorithm; Availability; Clustering algorithms; Clustering methods; Database systems; Information science; Information systems; Object detection; Sorting; Space technology; Sun;
Conference_Titel :
Computational Intelligence and Software Engineering, 2009. CiSE 2009. International Conference on
Conference_Location :
Wuhan
Print_ISBN :
978-1-4244-4507-3
Electronic_ISBN :
978-1-4244-4507-3
DOI :
10.1109/CISE.2009.5364382