Title :
Eliminating Duplicates in Information Integration: An Adaptive, Extensible Framework
Author :
Shahri, Hamid Haidarian ; Shahri, Saied Haidarian
Author_Institution :
Dept. of Comput. Sci., Maryland Univ., College Park, MD
Abstract :
Data cleaning is an inevitable problem when integrating data from distributed operational databases, because no unified set of standards spans all the distributed sources. One of the most challenging phases of data cleaning is removing fuzzy duplicate records. Approximate or fuzzy duplicates pertain to two or more tuples that describe the same real-world entity using different syntaxes. In other words, they have the same semantics but different syntaxes. Eliminating fuzzy duplicates is applicable in any database but is critical in data-integration and analytical-processing domains, which involve data warehouses, data mining applications, and decision support systems. Earlier approaches, which required hard coding rules based on a schema, were time consuming and tedious, and you couldn´t later adapt the rules. We propose a novel duplicate-elimination framework which exploits fuzzy inference and includes unique machine learning capabilities to let users clean their data flexibly and effortlessly without requiring any coding
Keywords :
data integrity; data mining; database management systems; fuzzy reasoning; learning (artificial intelligence); analytical-processing domains; data cleaning; data mining applications; data warehouses; data-integration; decision support systems; distributed operational databases; duplicate-elimination framework; fuzzy duplicate records; fuzzy inference; information integration; machine learning capabilities; Cleaning; Clustering algorithms; Data analysis; Data mining; Data warehouses; Distributed databases; Fuzzy systems; Humans; Inference algorithms; Libraries; data mining; data warehouse and repository; database applications; fuzzy and probabilistic reasoning; knowledge management applications; uncertainty;
Journal_Title :
Intelligent Systems, IEEE
DOI :
10.1109/MIS.2006.90