DocumentCode
1556632
Title
A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication
Author
Christen, Peter
Author_Institution
The Australian National University, Canberra
Volume
24
Issue
9
fYear
2012
Firstpage
1537
Lastpage
1555
Abstract
Record linkage is the process of matching records from several databases that refer to the same entities. When applied on a single database, this process is known as deduplication. Increasingly, matched data are becoming important in many application areas, because they can contain information that is not available otherwise, or that is too costly to acquire. Removing duplicate records in a single database is a crucial step in the data cleaning process, because duplicates can severely influence the outcomes of any subsequent data processing or data mining. With the increasing size of today´s databases, the complexity of the matching process becomes one of the major challenges for record linkage and deduplication. In recent years, various indexing techniques have been developed for record linkage and deduplication. They are aimed at reducing the number of record pairs to be compared in the matching process by removing obvious nonmatching pairs, while at the same time maintaining high matching quality. This paper presents a survey of 12 variations of 6 indexing techniques. Their complexity is analyzed, and their performance and scalability is evaluated within an experimental framework using both synthetic and real data sets. No such detailed survey has so far been published.
Keywords
Complexity theory; Couplings; Encoding; Indexing; Data linkage; blocking; data matching; entity resolution; experimental evaluation; index techniques; scalability;
fLanguage
English
Journal_Title
Knowledge and Data Engineering, IEEE Transactions on
Publisher
ieee
ISSN
1041-4347
Type
jour
DOI
10.1109/TKDE.2011.127
Filename
5887335
Link To Document