Title : 
Learning-Based Fusion for Data Deduplication
         
        
            Author : 
Dinerstein, Jared ; Dinerstein, Sabra ; Egbert, Parris K. ; Clyde, Stephen W.
         
        
            Author_Institution : 
Utah State Univ., Logan, UT, USA
         
        
        
        
        
        
            Abstract : 
Rule-based deduplication utilizes expert domain knowledge to identify and remove duplicate data records. Achieving high accuracy in a rule-based system requires the creation of rules containing a good combination of discriminatory clues. Unfortunately, accurate rule-based deduplication often requires significant manual tuning of both the rules and the corresponding thresholds. This need for manual tuning reduces the efficacy of rule-based deduplication and its applicability to real-world data sets. No adequate solution exists for this problem. We propose a novel technique for rule-based deduplication. We apply individual deduplication rules, and combine the resultant match scores via learning-based information fusion. We show empirically that our fused deduplication technique achieves higher average accuracy than traditional rule-based deduplication. Further, our technique alleviates the need for manual tuning of the deduplication rules and corresponding thresholds.
         
        
            Keywords : 
database management systems; knowledge based systems; learning (artificial intelligence); sensor fusion; expert domain knowledge; learning-based information fusion; rule-based data deduplication; Atomic measurements; Computer errors; Data models; Databases; Knowledge based systems; Machine intelligence; Machine learning; Manuals; Support vector machines; XML; information fusion; rule-based data deduplication; supervised learning;
         
        
        
        
            Conference_Titel : 
Machine Learning and Applications, 2008. ICMLA '08. Seventh International Conference on
         
        
            Conference_Location : 
San Diego, CA
         
        
            Print_ISBN : 
978-0-7695-3495-4
         
        
        
            DOI : 
10.1109/ICMLA.2008.83