• DocumentCode
    3048281
  • Title

    Managing data quality by identifying the noisiest data samples

  • Author

    Prasad, K. Hima ; Chaturvedi, Snigdha ; Faruquie, Tanveer A. ; Subramaniam, L. Venkata ; Mohania, Mukesh K.

  • Author_Institution
    IBM Res. India, New Delhi, India
  • fYear
    2012
  • fDate
    8-10 July 2012
  • Firstpage
    90
  • Lastpage
    95
  • Abstract
    Enterprise datasets are often noisy. Several columns can have non-standard, erroneous or missing information. Poor quality data can lead to incorrect reporting and wrong conclusions being drawn. Data cleansing involves standardizing such data to improve its quality. Often data cleansing tasks involve writing rules manually. The step involves understanding the data quality issues and then writing data transformation rules to correct these issues. This is a human intensive task. In this study we propose a method to identify noisy subsets of huge unlabelled textual datasets. This is a two step process where in the first step we develop an estimation tool to predict the data quality on an unlabelled text dataset as produced by a segmentation model. The accuracy of the proposed method is shown on a real life dataset.
  • Keywords
    business data processing; text analysis; data cleansing; data quality management; enterprise datasets; estimation tool; noisiest data sample identification; segmentation model; unlabelled text dataset; writing data transformation rules; Atmospheric measurements; Cities and towns; Entropy; Feature extraction; Lead; Particle measurements; Roads;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Service Operations and Logistics, and Informatics (SOLI), 2012 IEEE International Conference on
  • Conference_Location
    Suzhou
  • Print_ISBN
    978-1-4673-2400-7
  • Type

    conf

  • DOI
    10.1109/SOLI.2012.6273510
  • Filename
    6273510