• DocumentCode
    3352144
  • Title

    A technique for the quantitative measure of data cleanliness

  • Author

    Wakchaure, Abhijit ; Eaglin, Ronald ; Motlagh, Bahman

  • Author_Institution
    Sch. of Electr. Eng. & Comput. Sci., Univ. of Central Florida, Orlando, FL
  • fYear
    2008
  • fDate
    21-24 Sept. 2008
  • Firstpage
    1258
  • Lastpage
    1263
  • Abstract
    With the amount of data that is collected, viewed, processed, and stored today, techniques for the analysis of the accuracy of data are extremely important. Since we cannot improve what we cannot measure, the need for a tangible quantitative measure of data quality is a necessity. This paper focuses on a data-cleanliness algorithm, which makes use of the dasiaLevenshtein distancepsila, to measure the data quality for a criminal records database. Actual law enforcement name records were used for this research. The results help us arrive at the extent of dirtiness in the data, and also highlight the different types of dirty data. We then go on to show how measuring the data quality not only helps in setting up guidelines for the data clean-up process, but also can be used as a metric for cross-comparing like databases.
  • Keywords
    data analysis; data mining; Levenshtein distance; criminal records database; data accuracy; data cleanliness; data quality; Bismuth; Computer science; Costs; Data engineering; Data mining; Data warehouses; Databases; Electric variables measurement; Guidelines; Law enforcement; data cleanliness; data quality; dirty data;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cybernetics and Intelligent Systems, 2008 IEEE Conference on
  • Conference_Location
    Chengdu
  • Print_ISBN
    978-1-4244-1673-8
  • Electronic_ISBN
    978-1-4244-1674-5
  • Type

    conf

  • DOI
    10.1109/ICCIS.2008.4670930
  • Filename
    4670930