DocumentCode
2296171
Title
Data Deduplication Techniques and Analysis
Author
Maddodi, Srivatsa ; Attigeri, G.V. ; Karunakar, A.K.
Author_Institution
Dept. of Inf. & Commun. Technol., Manipal Inst. of Technol., Manipal, India
fYear
2010
fDate
19-21 Nov. 2010
Firstpage
664
Lastpage
668
Abstract
Data warehouses are the repositories of data collected from several data sources, which form the backbone of most of the decision support applications. As the data sources are independent, they may adopt independent and potentially inconsistent conventions. In data warehousing applications during ETL (Extraction, Transformation and Loading) or even in OLTP (On Line Transaction Processing) applications we are often encountered with duplicate records in table. Moreover, data entry mistakes at any of these sources introduce more errors. Since high quality data is essential for gaining the confidence of users of decision support applications, ensuring high data quality is critical to the success of data warehouse implementations. Therefore, significant amount of time and money are spent on the process of detecting and correcting errors and inconsistencies. The process of cleaning dirty data is often referred to as data cleaning. To make the table data consistent and accurate we need to get rid of these duplicate records from the table. In this paper we discuss different strategies of Deduplication along with their pros and cons and some of methods used to prevent duplication in database. In addition, we have made performance evaluation with Microsoft SQL-Server 2008 on Food Mart and AdventureDB Warehouses.
Keywords
SQL; data mining; data warehouses; decision support systems; transaction processing; Microsoft SQL-Server 2008; OLTP; data deduplication; data extraction; data loading; data sources; data transformation; data warehouses; decision support applications; on line transaction processing; Data Cleaning; Deduplication; ETL; OLTP;
fLanguage
English
Publisher
ieee
Conference_Titel
Emerging Trends in Engineering and Technology (ICETET), 2010 3rd International Conference on
Conference_Location
Goa
ISSN
2157-0477
Print_ISBN
978-1-4244-8481-2
Electronic_ISBN
2157-0477
Type
conf
DOI
10.1109/ICETET.2010.42
Filename
5698409
Link To Document