Title :
E-Clean: A Data Cleaning Framework for Patient Data
Author :
Mohamed, Hasimah Hj ; Kheng, Tee Leong ; Collin, Chee ; Lee, Ong Siong
Author_Institution :
Sch. of Comput. Sci., Univ. Sains Malaysia, Pulau, Malaysia
Abstract :
We need to prepare quality data by pre-processing the raw data. Data cleaning, also called data cleansing or scrubbing, deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data. Data cleaning system are needed to support any changes in the structure, representation or content of data. There are three parts in the cleaning process, i.e. extract the invalid value, matching attributes with valid values and data cleaning algorithm. Our system uses the extract, transform and load model as the system main process model to serve as a guideline for the implementation of the system. Besides that, parsing techniques is also use for the identification of dirty data. The method that we choose for matching attributes is regular expression. Among those data cleaning algorithms, k-Nearest Neighbor algorithm is selected for the data cleaning part of this project because it is simple to understand and easy to implement.
Keywords :
attribute grammars; data handling; medical administrative data processing; E-Clean; data cleaning algorithm; data cleansing; data inconsistency; data scrubbing; dirty data identification; error detection; error removal; k-nearest neighbor algorithm; matching attributes; parsing techniques; patient data; raw data pre-processing; Classification algorithms; Cleaning; Data mining; Databases; Knowledge based systems; Load modeling; Transforms; data cleaning; k-Nearest Neighbor; regular expression;
Conference_Titel :
Informatics and Computational Intelligence (ICI), 2011 First International Conference on
Conference_Location :
Bandung
Print_ISBN :
978-1-4673-0091-9
DOI :
10.1109/ICI.2011.21