Title :
Optimal Training Data Selection for Rule-Based Data Cleansing Models
Author :
Chaturvedi, Snigdha ; Faruquie, Tanveer A. ; Subramaniam, L. Venkata ; Prasad, K. Hima ; Venkatachaliah, Girish ; Padmanabhan, Sriram
Author_Institution :
IBM Res. - India, New Delhi, India
fDate :
March 29 2011-April 2 2011
Abstract :
Enterprises today accumulate huge quantities of data which is often noisy and unstructured in nature making data cleansing an important task. Data cleansing refers to standardizing data from different sources to a common format so that data can be better utilized. Most of the enterprise data cleansing models are rule based involving lot of manual effort. Writing data quality rules is tedious task and often results in creation of erroneous rules because of the ambiguities that the data presents. A robust data cleansing model should be capable of handling a wide variety of records which is often dependant on the choice of the sample records knowledge engineer uses to write the rules. In this paper we present a method to select a diverse set of data records which when used to create the rule based data cleansing model can cover the maximum number of records. We also present a similarity metric between two records which help in choosing the diverse set of data samples. We also present a crowd sourcing based labeling mechanism to label the diverse records selected by the system so that collective intelligence of crowd can be used to eliminate the errors that occur in labeling sample data. We also present a method to select difficult set of diverse examples so that the crowd and the rule writer services can be effectively utilized to create a better cleansing model. We also present a method selection of such records for updating an existing rule set. We present the experimental results to show the effectiveness of the proposed methods. Results demonstrate an increase of 12% in the number of rules written, using this procedure. We also show that the method identifies records on which the existing model yields lower accuracy than on the records identified by other techniques, and thus identifies records that are more difficult to cleanse for the existing model.
Keywords :
data handling; database management systems; collective intelligence; crowd sourcing based labeling mechanism; data presentation; data records; knowledge engineer; nature making data cleansing; optimal training data selection; rule based data cleansing models; rule writer services; Data models; Feature extraction; Knowledge based systems; Knowledge engineering; Labeling; Training; Writing; Data Cleansing; Data Selection; Rule-based System;
Conference_Titel :
SRII Global Conference (SRII), 2011 Annual
Conference_Location :
San Jose, CA
Print_ISBN :
978-1-61284-415-2
Electronic_ISBN :
978-0-7695-4371-0
DOI :
10.1109/SRII.2011.25