• DocumentCode
    3048230
  • Title

    Automated selection of blocking columns for record linkage

  • Author

    Prasad, K. Hima ; Chaturvedi, Snigdha ; Faruquie, Tanveer A. ; Subramaniam, L. Venkata ; Mohania, Mukesh K.

  • Author_Institution
    IBM Res. India, New Delhi, India
  • fYear
    2012
  • fDate
    8-10 July 2012
  • Firstpage
    78
  • Lastpage
    83
  • Abstract
    Record Linkage is an essential but expensive step in enterprise data management. In most deployments, blocking techniques are employed which can reduce the number of record pair comparisons and hence, the computational complexity of the task. Blocking algorithms require a careful selection of column(s) to be used for blocking. Selection of appropriate blocking column is critical to the accuracy and speed-up offered by the blocking technique and requires intervention by data quality practitioners who can exploit prior domain knowledge to analyse a small sample of the huge database and decide the blocking column(s). However, the selection of optimal blocking column(s) can depend heavily on the quality of data and requires extensive analysis. An experienced data quality practitioner is required for the selection of optimal blocking columns. In this paper, we present a datadriven approach to automatically choose blocking column(s), motivated from the modus operandi of data quality practitioners. Our approach produces a ranked list of columns by evaluating them for appropriateness for blocking on the basis of factors including data quality and distribution. We evaluate our choice of blocking columns through experiments on real world and synthetic datasets. We extend our approach to be employed in scenarios where more than one column can be used for blocking.
  • Keywords
    data handling; records management; blocking columns automated selection; blocking techniques; data distribution; data driven approach; data quality practitioners; domain knowledge; enterprise data management; real world datasets; record linkage; record pair comparison reduction; synthetic datasets; task computational complexity; Couplings; Gold; Niobium; Standards;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Service Operations and Logistics, and Informatics (SOLI), 2012 IEEE International Conference on
  • Conference_Location
    Suzhou
  • Print_ISBN
    978-1-4673-2400-7
  • Type

    conf

  • DOI
    10.1109/SOLI.2012.6273508
  • Filename
    6273508