DocumentCode
2319204
Title
A threshold-based similarity measure for duplicate detection
Author
Ektefa, Mohammadreza ; Sidi, Fatimah ; Ibrahim, Hamidah ; Jabar, Marzanah A. ; Memar, Sara ; Ramli, Abdullah
Author_Institution
Dept. of CS, UPM, Serdang, Malaysia
fYear
2011
fDate
25-28 Sept. 2011
Firstpage
37
Lastpage
41
Abstract
In order to extract beneficial information and recognize a particular pattern from huge data stored in different databases with different formats, data integration is essential. However the problem that arises here is that data integration may lead to duplication. In other words, due to the availability of data in different formats, there might be some records which refer to the same entity. Duplicate detection or record linkage is a technique which is used to detect and match duplicate records which are generated in data integration process. Most approaches concentrated on string similarity measures for comparing records. However, they fail to identify records which share the semantic information. So, in this study, a threshold-based method which takes into account both string and semantic similarity measures for comparing record pairs. This method is experimented on a real world dataset, namely Restaurant and its effectiveness is measured based on several standard evaluation metrics. As experimental results indicate, the proposed similarity method which is based on the combination of string and semantic similarity measures outperforms the individual similarity measures with the F-measure of 99.1% in Restaurant dataset. Therefore, based on experimental results, besides string similarity, semantic similarity should be considered in order to detect duplicate records more effectively.
Keywords
data mining; entity-relationship modelling; feature extraction; records management; string matching; F-measure; Restaurant dataset; beneficial information extraction; data integration process; duplicate detection; record linkage; semantic information; standard evaluation metrics; threshold-based method; threshold-based similarity measure; Biomedical measurements; Conferences; Couplings; Current measurement; Open systems; Semantics; Duplicate detection; Record linkage; Semantic similarity; String similarity;
fLanguage
English
Publisher
ieee
Conference_Titel
Open Systems (ICOS), 2011 IEEE Conference on
Conference_Location
Langkawi
Print_ISBN
978-1-61284-931-7
Type
conf
DOI
10.1109/ICOS.2011.6079233
Filename
6079233
Link To Document