• DocumentCode
    840508
  • Title

    Duplicate Record Detection: A Survey

  • Author

    Elmagarmid, Ahmed K. ; Ipeirotis, Panagiotis G. ; Verykios, Vassilios S.

  • Author_Institution
    Dept. of Comput. Sci. & Cyber Center, Purdue Univ., West Lafayette, IN
  • Volume
    19
  • Issue
    1
  • fYear
    2007
  • Firstpage
    1
  • Lastpage
    16
  • Abstract
    Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a difficult task. Errors are introduced as the result of transcription errors, incomplete information, lack of standard formats, or any combination of these factors. In this paper, we present a thorough analysis of the literature on duplicate record detection. We cover similarity metrics that are commonly used to detect similar field entries, and we present an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database. We also cover multiple techniques for improving the efficiency and scalability of approximate duplicate detection algorithms. We conclude with coverage of existing tools and with a brief discussion of the big open problems in the area
  • Keywords
    data integrity; data mining; database management systems; database management system; duplicate detection algorithm; duplicate record detection; transcription error; Cleaning; Computer Society; Computer errors; Cost function; Couplings; Detection algorithms; Mirrors; Relational databases; Scalability; Uncertainty; Duplicate detection; data cleaning; data deduplication; data integration; database hardening; entity matching.; entity resolution; fuzzy duplicate detection; identity uncertainty; instance identification; name matching; record linkage;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2007.250581
  • Filename
    4016511