DocumentCode
2005876
Title
Learning-Based Fusion for Data Deduplication
Author
Dinerstein, Jared ; Dinerstein, Sabra ; Egbert, Parris K. ; Clyde, Stephen W.
Author_Institution
Utah State Univ., Logan, UT, USA
fYear
2008
fDate
11-13 Dec. 2008
Firstpage
66
Lastpage
71
Abstract
Rule-based deduplication utilizes expert domain knowledge to identify and remove duplicate data records. Achieving high accuracy in a rule-based system requires the creation of rules containing a good combination of discriminatory clues. Unfortunately, accurate rule-based deduplication often requires significant manual tuning of both the rules and the corresponding thresholds. This need for manual tuning reduces the efficacy of rule-based deduplication and its applicability to real-world data sets. No adequate solution exists for this problem. We propose a novel technique for rule-based deduplication. We apply individual deduplication rules, and combine the resultant match scores via learning-based information fusion. We show empirically that our fused deduplication technique achieves higher average accuracy than traditional rule-based deduplication. Further, our technique alleviates the need for manual tuning of the deduplication rules and corresponding thresholds.
Keywords
database management systems; knowledge based systems; learning (artificial intelligence); sensor fusion; expert domain knowledge; learning-based information fusion; rule-based data deduplication; Atomic measurements; Computer errors; Data models; Databases; Knowledge based systems; Machine intelligence; Machine learning; Manuals; Support vector machines; XML; information fusion; rule-based data deduplication; supervised learning;
fLanguage
English
Publisher
ieee
Conference_Titel
Machine Learning and Applications, 2008. ICMLA '08. Seventh International Conference on
Conference_Location
San Diego, CA
Print_ISBN
978-0-7695-3495-4
Type
conf
DOI
10.1109/ICMLA.2008.83
Filename
4724957
Link To Document