• DocumentCode
    3104732
  • Title

    Adaptive Blocking: Learning to Scale Up Record Linkage

  • Author

    Bilenko, Mikhail ; Kamath, Beena ; Mooney, Raymond J.

  • Author_Institution
    Microsoft Res., Redmond, WA
  • fYear
    2006
  • fDate
    18-22 Dec. 2006
  • Firstpage
    87
  • Lastpage
    96
  • Abstract
    Many data mining tasks require computing similarity between pairs of objects. Pairwise similarity computations are particularly important in record linkage systems, as well as in clustering and schema mapping algorithms. Because the number of object pairs grows quadratically with the size of the dataset, computing similarity between all pairs is impractical and becomes prohibitive for large datasets and complex similarity functions. Blocking methods alleviate this problem by efficiently selecting approximately similar object pairs for subsequent distance computations, leaving out the remaining pairs as dissimilar. Previously proposed blocking methods require manually constructing an index- based similarity function or selecting a set of predicates, followed by hand-tuning of parameters. In this paper, we introduce an adaptive framework for automatically learning blocking functions that are efficient and accurate. We describe two predicate-based formulations of learnable blocking functions and provide learning algorithms for training them. The effectiveness of the proposed techniques is demonstrated on real and simulated datasets, on which they prove to be more accurate than non-adaptive blocking methods.
  • Keywords
    data mining; learning (artificial intelligence); adaptive blocking; data mining; learning algorithms; nonadaptive blocking methods; pairwise similarity computations; record linkage systems; schema mapping algorithms; subsequent distance computations; Clustering algorithms; Computational modeling; Couplings; Data mining; Indexing; Machine learning; Machine learning algorithms; Sorting; Sparse matrices; Uncertainty;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining, 2006. ICDM '06. Sixth International Conference on
  • Conference_Location
    Hong Kong
  • ISSN
    1550-4786
  • Print_ISBN
    0-7695-2701-7
  • Type

    conf

  • DOI
    10.1109/ICDM.2006.13
  • Filename
    4053037