• DocumentCode
    570227
  • Title

    A framework for entity resolution with efficient blocking

  • Author

    Shu, Liangcai ; Lin, Can ; Meng, Weiyi ; Han, Yue ; Yu, Clement T. ; Smalheiser, Neil R.

  • Author_Institution
    Dept. of Comput. Sci., State Univ. of New York at Binghamton, Binghamton, NY, USA
  • fYear
    2012
  • fDate
    8-10 Aug. 2012
  • Firstpage
    431
  • Lastpage
    440
  • Abstract
    In applications of Web data integration, we frequently need to identify whether data objects in different data sources represent the same entity in the real world. This problem is known as entity resolution. In this paper, we propose a generic framework for entity resolution for relational data sets, called BARM, consisting of the Blocker, Attribute matchers and the Record Matcher. BARM is convenient for different blocking and matching algorithms to fit into it. For the blocker, we apply the SPectrAl Neighborhood (SPAN), a state-of-the-art blocking algorithm, to our data sets and show that SPAN is effective and efficient. For attribute matchers, we propose the Context Sensitive Value Matching Library (CSVML) for matching attribute values and also an approach to evaluate the goodness of matching functions. CSVML takes the meaning and context of attribute values into consideration and therefore has good performance, as shown in experimental results. We adopt Bayesian network as the record matcher in the framework and propose a method of inference from Bayesian network based on Markov blanket of the network. As a comparison, we also apply three other classifiers, including Decision Tree, Support Vector Machines, and the Naive Bayes classifier to our data sets. Experiments show that Bayesian network is advantageous in the book domain.
  • Keywords
    Bayes methods; Internet; Markov processes; data integration; pattern matching; relational databases; BARM; Bayesian network; CSVML; Markov blanket; Web data integration; attribute matcher; attribute value matching; blocker; context sensitive value matching library; entity resolution; inference method; record matcher; relational data set; spectral neighborhood; Bayesian methods; Books; Context; Databases; Erbium; Sparse matrices; Vectors;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Information Reuse and Integration (IRI), 2012 IEEE 13th International Conference on
  • Conference_Location
    Las Vegas, NV
  • Print_ISBN
    978-1-4673-2282-9
  • Electronic_ISBN
    978-1-4673-2283-6
  • Type

    conf

  • DOI
    10.1109/IRI.2012.6303041
  • Filename
    6303041