• DocumentCode
    3128479
  • Title

    Approximate Record Matching Using Hash Grams

  • Author

    Gollapalli, Mohammed ; Li, Xue ; Wood, Ian ; Governatori, Guido

  • Author_Institution
    Univ. of Queensland, Brisbane, QLD, Australia
  • fYear
    2011
  • fDate
    11-11 Dec. 2011
  • Firstpage
    504
  • Lastpage
    511
  • Abstract
    Accurately identifying duplicate records between multiple data sources is a persistent problem that continues to plague organizations and researchers alike. Small inconsistencies between records can prevent detection between two otherwise identical records. In this paper, we present a new probabilistic h-gram (hash gram) record matching technique by extending traditional n-grams and utilizing scale based hashing for equality testing. h-gram matching highly reduces the number of comparisons to be performed for duplicate record detection applicable to a variety of data types and data sizes by transforming data into its equivalent numerical realities. One of the key features of h-gram matching is that it is highly extensible providing more intuitive and flexible results. With the sampling technique in place, our method can be applied on variable size databases to perform data linkage and probabilistic results can be quickly obtained. We have extensively evaluated h-gram matching on large samples of real-world data and the results show higher level of accuracy as well as reduction in required time when compared with existing techniques.
  • Keywords
    data handling; pattern matching; probability; records management; Hash grams; data source; equality testing; numerical realities; plague organizations; probabilistic h-gram; record duplication; record matching approximation; Accuracy; Australia; Couplings; Databases; Educational institutions; Probabilistic logic; Servers; Approximate Matching; Data Linkage; Record Matching; Structure Matching;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining Workshops (ICDMW), 2011 IEEE 11th International Conference on
  • Conference_Location
    Vancouver, BC
  • Print_ISBN
    978-1-4673-0005-6
  • Type

    conf

  • DOI
    10.1109/ICDMW.2011.33
  • Filename
    6137421