Title :
Probabilistic estimates of attribute statistics and match likelihood for people entity resolution
Author :
Xin Wang ; Ang Sun ; Kardes, Hakan ; Agrawal, Sanjay ; Lin Chen ; Borthwick, Andrew
Author_Institution :
Data Res., Intelius Inc., Bellevue, WA, USA
Abstract :
For big data practitioners, data integration/entity resolution/record linkage is one of the key challenges we face from day to day. Entity resolution/record linkage with high precision and recall on a large graph with billions of nodes, and hundreds of times more edges poses significant scalability challenges. Similarity based graph partition is still the most scalable method available. This paper presents a probabilistic method to approximate the match likelihood of a pair of records by incorporating values of different attributes and their aggregates/statistics. The quality of the approximates depend on the accuracy of the estimates of the aggregated values. The paper adapts the GTM model described in [1] to obtain the estimates. We present experimental results based on real world commercial data sources to show that the estimates obtained via GTM model is better than the baseline. Our experimental results also showed that the approximate match likelihood can improve the recall of the similarity function.
Keywords :
Big Data; data integration; graph theory; probability; GTM model; attribute statistics; big data practitioners; commercial data sources; data integration; entity resolution; match likelihood; people entity resolution; probabilistic estimates; record linkage; scalability challenges; scalable method; similarity based graph partition; Adaptation models; Cities and towns; Clustering algorithms; Couplings; Frequency estimation; Sociology; Approximate Probabilistic Estimates; Big Data Demographic Information; Data Fusion; Data Integration; Entity Resolution; Record Linkage;
Conference_Titel :
Big Data (Big Data), 2014 IEEE International Conference on
Conference_Location :
Washington, DC
DOI :
10.1109/BigData.2014.7004459