DocumentCode
740536
Title
Principled Graph Matching Algorithms for Integrating Multiple Data Sources
Author
Zhang, Duo ; Rubinstein, Benjamin I. P. ; Gemmell, Jim
Author_Institution
Twitter, 1355 Market Street, Suite 900, San Francisco, CA, USA
Volume
27
Issue
10
fYear
2015
Firstpage
2784
Lastpage
2796
Abstract
This paper explores combinatorial optimization for problems of max-weight graph matching on multi-partite graphs, which arise in integrating multiple data sources. In the most common two-source case, it is often desirable for the final matching to be one-to-one; the database and statistical record linkage communities accomplish this by weighted bipartite graph matching on similarity scores. Such matchings are intuitively appealing: they leverage a natural global property of many real-world entity stores—that of being nearly deduped—and are known to provide significant improvements to precision and recall. Unfortunately, unlike the bipartite case, exact max-weight matching on multi-partite graphs is known to be NP-hard. Our two-fold algorithmic contributions approximate multi-partite max-weight matching: our first algorithm borrows optimization techniques common to Bayesian probabilistic inference; our second is a greedy approximation algorithm. In addition to a theoretical guarantee on the latter, we present comparisons on a real-world entity resolution problem from Bing significantly larger than typically found in the literature, on publication data, and on a series of synthetic problems. Our results quantify significant improvements due to exploiting multiple sources, which are made possible by global one-to-one constraints linking otherwise independent matching sub-problems. We also discover that our algorithms are complementary: one being much more robust under noise, and the other being simple to implement and very fast to run.
Keywords
Approximation methods; Couplings; Databases; Erbium; Inference algorithms; Motion pictures; Optimization; Data integration; message-passing algorithms; weighted graph matching;
fLanguage
English
Journal_Title
Knowledge and Data Engineering, IEEE Transactions on
Publisher
ieee
ISSN
1041-4347
Type
jour
DOI
10.1109/TKDE.2015.2426714
Filename
7095602
Link To Document