DocumentCode :
751861
Title :
Record Matching over Query Results from Multiple Web Databases
Author :
Su, Weifeng ; Wang, Jiying ; Lochovsky, Frederick H.
Author_Institution :
Comput. Sci. & Technol. Program, BNUHKBU United Int. Coll., Zhuhai, China
Volume :
22
Issue :
4
fYear :
2010
fDate :
4/1/2010 12:00:00 AM
Firstpage :
578
Lastpage :
589
Abstract :
Record matching, which identifies the records that represent the same real-world entity, is an important step for data integration. Most state-of-the-art record matching methods are supervised, which requires the user to provide training data. These methods are not applicable for the Web database scenario, where the records to match are query results dynamically generated on-the-fly. Such records are query-dependent and a prelearned method using training examples from previous query results may fail on the results of a new query. To address the problem of record matching in the Web database scenario, we present an unsupervised, online record matching method, UDD, which, for a given query, can effectively identify duplicates from the query result records of multiple Web databases. After removal of the same-source duplicates, the ??presumed?? nonduplicate records from the same source can be used as training examples alleviating the burden of users having to manually label training examples. Starting from the nonduplicate set, we use two cooperating classifiers, a weighted component similarity summing classifier and an SVM classifier, to iteratively identify duplicates in the query results from multiple Web databases. Experimental results show that UDD works well for the Web database scenario where existing supervised methods do not apply.
Keywords :
information retrieval systems; pattern classification; pattern matching; query processing; support vector machines; SVM classifier; cooperating classifiers; data integration; multiple web databases; online record matching method; query results; same-source duplicates; training data; Record matching; SVM.; Web database; data deduplication; data integration; duplicate detection; query result record; record linkage;
fLanguage :
English
Journal_Title :
Knowledge and Data Engineering, IEEE Transactions on
Publisher :
ieee
ISSN :
1041-4347
Type :
jour
DOI :
10.1109/TKDE.2009.90
Filename :
4840347
Link To Document :
بازگشت