• DocumentCode
    2005876
  • Title

    Learning-Based Fusion for Data Deduplication

  • Author

    Dinerstein, Jared ; Dinerstein, Sabra ; Egbert, Parris K. ; Clyde, Stephen W.

  • Author_Institution
    Utah State Univ., Logan, UT, USA
  • fYear
    2008
  • fDate
    11-13 Dec. 2008
  • Firstpage
    66
  • Lastpage
    71
  • Abstract
    Rule-based deduplication utilizes expert domain knowledge to identify and remove duplicate data records. Achieving high accuracy in a rule-based system requires the creation of rules containing a good combination of discriminatory clues. Unfortunately, accurate rule-based deduplication often requires significant manual tuning of both the rules and the corresponding thresholds. This need for manual tuning reduces the efficacy of rule-based deduplication and its applicability to real-world data sets. No adequate solution exists for this problem. We propose a novel technique for rule-based deduplication. We apply individual deduplication rules, and combine the resultant match scores via learning-based information fusion. We show empirically that our fused deduplication technique achieves higher average accuracy than traditional rule-based deduplication. Further, our technique alleviates the need for manual tuning of the deduplication rules and corresponding thresholds.
  • Keywords
    database management systems; knowledge based systems; learning (artificial intelligence); sensor fusion; expert domain knowledge; learning-based information fusion; rule-based data deduplication; Atomic measurements; Computer errors; Data models; Databases; Knowledge based systems; Machine intelligence; Machine learning; Manuals; Support vector machines; XML; information fusion; rule-based data deduplication; supervised learning;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Machine Learning and Applications, 2008. ICMLA '08. Seventh International Conference on
  • Conference_Location
    San Diego, CA
  • Print_ISBN
    978-0-7695-3495-4
  • Type

    conf

  • DOI
    10.1109/ICMLA.2008.83
  • Filename
    4724957