DocumentCode
570227
Title
A framework for entity resolution with efficient blocking
Author
Shu, Liangcai ; Lin, Can ; Meng, Weiyi ; Han, Yue ; Yu, Clement T. ; Smalheiser, Neil R.
Author_Institution
Dept. of Comput. Sci., State Univ. of New York at Binghamton, Binghamton, NY, USA
fYear
2012
fDate
8-10 Aug. 2012
Firstpage
431
Lastpage
440
Abstract
In applications of Web data integration, we frequently need to identify whether data objects in different data sources represent the same entity in the real world. This problem is known as entity resolution. In this paper, we propose a generic framework for entity resolution for relational data sets, called BARM, consisting of the Blocker, Attribute matchers and the Record Matcher. BARM is convenient for different blocking and matching algorithms to fit into it. For the blocker, we apply the SPectrAl Neighborhood (SPAN), a state-of-the-art blocking algorithm, to our data sets and show that SPAN is effective and efficient. For attribute matchers, we propose the Context Sensitive Value Matching Library (CSVML) for matching attribute values and also an approach to evaluate the goodness of matching functions. CSVML takes the meaning and context of attribute values into consideration and therefore has good performance, as shown in experimental results. We adopt Bayesian network as the record matcher in the framework and propose a method of inference from Bayesian network based on Markov blanket of the network. As a comparison, we also apply three other classifiers, including Decision Tree, Support Vector Machines, and the Naive Bayes classifier to our data sets. Experiments show that Bayesian network is advantageous in the book domain.
Keywords
Bayes methods; Internet; Markov processes; data integration; pattern matching; relational databases; BARM; Bayesian network; CSVML; Markov blanket; Web data integration; attribute matcher; attribute value matching; blocker; context sensitive value matching library; entity resolution; inference method; record matcher; relational data set; spectral neighborhood; Bayesian methods; Books; Context; Databases; Erbium; Sparse matrices; Vectors;
fLanguage
English
Publisher
ieee
Conference_Titel
Information Reuse and Integration (IRI), 2012 IEEE 13th International Conference on
Conference_Location
Las Vegas, NV
Print_ISBN
978-1-4673-2282-9
Electronic_ISBN
978-1-4673-2283-6
Type
conf
DOI
10.1109/IRI.2012.6303041
Filename
6303041
Link To Document