Title : 
Clustering-based Missing Value Imputation for Data Preprocessing
         
        
            Author : 
Zhang, Chengqi ; Qin, Yongsong ; Zhu, Xiaofeng ; Zhang, Jilian ; Zhang, Shichao
         
        
            Author_Institution : 
Fac. of Inf. Technol., Univ. of Technol. Sydney, Broadway, NSW
         
        
        
        
        
        
            Abstract : 
Missing value imputation is an actual yet challenging issue confronted by machine learning and data mining. Existing missing value imputation is a procedure that replaces the missing values in a dataset by some plausible values. The plausible values are generally generated from the dataset using a deterministic, or random method. In this paper we propose a new and efficient missing value imputation based on data clustering, called CRI (clustering-based random imputation). In our approach, we fill up the missing values of an instance with those plausible values that are generated from the data similar to this instance using a kernel-based random method. Specifically, we first divide the dataset (exclude instances with missing values) into clusters. And then each of those instances with missing-values is assigned to a cluster most similar to it. Finally, missing values of an instance A are thus patched up with those plausible values that are generated using a kernel-based method to those instances from A´s cluster. Our experiments (some of them are with the decision tree induction system C 5.0) have proved the effectiveness of our proposed method in missing value imputation task.
         
        
            Keywords : 
data mining; learning (artificial intelligence); pattern clustering; random processes; clustering-based random imputation; data clustering; data mining; data preprocessing; kernel-based random method; machine learning; missing value imputation; Australia; Computer science; Data mining; Data preprocessing; Decision trees; Induction generators; Information technology; Machine learning; Nearest neighbor searches; Stochastic processes;
         
        
        
        
            Conference_Titel : 
Industrial Informatics, 2006 IEEE International Conference on
         
        
            Conference_Location : 
Singapore
         
        
            Print_ISBN : 
0-7803-9700-2
         
        
            Electronic_ISBN : 
0-7803-9701-0
         
        
        
            DOI : 
10.1109/INDIN.2006.275767