• DocumentCode
    28372
  • Title

    A Blocking Framework for Entity Resolution in Highly Heterogeneous Information Spaces

  • Author

    Papadakis, George ; Ioannou, Ekaterini ; Palpanas, T. ; Niederee, Claudia ; Nejdl, Wolfgang

  • Author_Institution
    L3S Res. Center, Leibniz Univ. of Hanover, Hanover, Germany
  • Volume
    25
  • Issue
    12
  • fYear
    2013
  • fDate
    Dec. 2013
  • Firstpage
    2665
  • Lastpage
    2682
  • Abstract
    In the context of entity resolution (ER) in highly heterogeneous, noisy, user-generated entity collections, practically all block building methods employ redundancy to achieve high effectiveness. This practice, however, results in a high number of pairwise comparisons, with a negative impact on efficiency. Existing block processing strategies aim at discarding unnecessary comparisons at no cost in effectiveness. In this paper, we systemize blocking methods for clean-clean ER (an inherently quadratic task) over highly heterogeneous information spaces (HHIS) through a novel framework that consists of two orthogonal layers: the effectiveness layer encompasses methods for building overlapping blocks with small likelihood of missed matches; the efficiency layer comprises a rich variety of techniques that significantly restrict the required number of pairwise comparisons, having a controllable impact on the number of detected duplicates. We map to our framework all relevant existing methods for creating and processing blocks in the context of HHIS, and additionally propose two novel techniques: attribute clustering blocking and comparison scheduling. We evaluate the performance of each layer and method on two large-scale, real-world data sets and validate the excellent balance between efficiency and effectiveness that they achieve.
  • Keywords
    pattern clustering; HHIS; attribute clustering blocking; block creation; block processing; blocking framework; clean-clean ER; comparison scheduling; duplicate detection; effectiveness layer; efficiency layer; entity resolution; highly heterogeneous information spaces; highly heterogeneous-noisy-user-generated entity collections; large-scale real-world data sets; layer performance evaluation; missed match likelihood; orthogonal layers; overlapping blocks; pairwise comparisons; quadratic task; Blocking methods; Context awareness; Data mining; Information retrieval; Redundancy; Information integration; blocking methods; entity resolution;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2012.150
  • Filename
    6255742