DocumentCode :
2079352
Title :
ProbClean: A probabilistic duplicate detection system
Author :
Beskales, George ; Soliman, Mohamed A. ; Ilyas, Ihab F. ; Ben-David, Shai ; Kim, Yubin
Author_Institution :
Sch. of Comput. Sci., Univ. of Waterloo, Waterloo, ON, Canada
fYear :
2010
fDate :
1-6 March 2010
Firstpage :
1193
Lastpage :
1196
Abstract :
One of the most prominent data quality problems is the existence of duplicate records. Current data cleaning systems usually produce one clean instance (repair) of the input data, by carefully choosing the parameters of the duplicate detection algorithms. Finding the right parameter settings can be hard, and in many cases, perfect settings do not exist. We propose ProbClean, a system that treats duplicate detection procedures as data processing tasks with uncertain outcomes. We use a novel uncertainty model that compactly encodes the space of possible repairs corresponding to different parameter settings. ProbClean efficiently supports relational queries and allows new types of queries against a set of possible repairs.
Keywords :
data integrity; ProbClean; data cleaning systems; data quality problems; probabilistic duplicate detection system; Business; Cleaning; Computer science; Data mining; Data processing; Data warehouses; Detection algorithms; Query processing; Relational databases; Uncertainty;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Engineering (ICDE), 2010 IEEE 26th International Conference on
Conference_Location :
Long Beach, CA
Print_ISBN :
978-1-4244-5445-7
Electronic_ISBN :
978-1-4244-5444-0
Type :
conf
DOI :
10.1109/ICDE.2010.5447744
Filename :
5447744
Link To Document :
بازگشت