Title :
A technique for the quantitative measure of data cleanliness
Author :
Wakchaure, Abhijit ; Eaglin, Ronald ; Motlagh, Bahman
Author_Institution :
Sch. of Electr. Eng. & Comput. Sci., Univ. of Central Florida, Orlando, FL
Abstract :
With the amount of data that is collected, viewed, processed, and stored today, techniques for the analysis of the accuracy of data are extremely important. Since we cannot improve what we cannot measure, the need for a tangible quantitative measure of data quality is a necessity. This paper focuses on a data-cleanliness algorithm, which makes use of the dasiaLevenshtein distancepsila, to measure the data quality for a criminal records database. Actual law enforcement name records were used for this research. The results help us arrive at the extent of dirtiness in the data, and also highlight the different types of dirty data. We then go on to show how measuring the data quality not only helps in setting up guidelines for the data clean-up process, but also can be used as a metric for cross-comparing like databases.
Keywords :
data analysis; data mining; Levenshtein distance; criminal records database; data accuracy; data cleanliness; data quality; Bismuth; Computer science; Costs; Data engineering; Data mining; Data warehouses; Databases; Electric variables measurement; Guidelines; Law enforcement; data cleanliness; data quality; dirty data;
Conference_Titel :
Cybernetics and Intelligent Systems, 2008 IEEE Conference on
Conference_Location :
Chengdu
Print_ISBN :
978-1-4244-1673-8
Electronic_ISBN :
978-1-4244-1674-5
DOI :
10.1109/ICCIS.2008.4670930