DocumentCode :
2082055
Title :
Effective automated Object Matching
Author :
Zardetto, Diego ; Scannapieco, Monica ; Catarci, Tiziana
Author_Institution :
Ist. Naz. di Statistica, Rome, Italy
fYear :
2010
fDate :
1-6 March 2010
Firstpage :
757
Lastpage :
768
Abstract :
Object Matching (OM) is the problem of identifying pairs of data-objects coming from different sources and representing the same real world object. Several methods have been proposed to solve OM problems, but none of them seems to be at the same time fully automated and very effective. In this paper we present a fundamentally new suite of methods that instead possesses both these abilities. We adopt a statistical approach based on mixture models, which structures an OM process into two consecutive tasks. First, mixture parameters are estimated by fitting the model to observed distance measures between pairs. Then, a probabilistic clustering of the pairs into Matches and Unmatches is obtained by exploiting the fitted model. In particular, we use a mixture model with component densities belonging to the Beta parametric family and we fit it by means of an original perturbation-like technique. Moreover, we solve the clustering problem according to both Maximum Likelihood and Minimum Cost objectives. To accomplish this task, optimal decision rules fulfilling one-to-one matching constraints are searched by a purposefully designed evolutionary algorithm. Notably, our suite of methods is distance-independent in the sense that it does not rely on any restrictive assumption on the function to be used when comparing data-objects. Even more interestingly, our approach is not confined to record linkage applications but can be applied to match also other kinds of dataobjects. We present several experiments on real data that validate the proposed methods and show their excellent effectiveness.
Keywords :
evolutionary computation; maximum likelihood estimation; pattern clustering; pattern matching; perturbation techniques; probability; Beta parametric family; automated object matching; clustering problem; component densities; evolutionary algorithm; maximum likelihood; minimum cost objectives; mixture models; one-to-one matching constraints; optimal decision rules; perturbation like technique; probabilistic clustering; Algorithm design and analysis; Automation; Constraint optimization; Costs; Couplings; Design optimization; Evolutionary computation; Maximum likelihood estimation; Parameter estimation; Probability distribution;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Engineering (ICDE), 2010 IEEE 26th International Conference on
Conference_Location :
Long Beach, CA
Print_ISBN :
978-1-4244-5445-7
Electronic_ISBN :
978-1-4244-5444-0
Type :
conf
DOI :
10.1109/ICDE.2010.5447904
Filename :
5447904
Link To Document :
بازگشت