Applying sensitivity analysis to missing data in classifiers

Author

Lei, Lei ; Wu, Naijun ; Liu, Peng

Author_Institution

Sch. of Inf. Manage. & Eng., Shanghai Univ. of Finance & Econ., China

Volume

2

fYear

2005

fDate

13-15 June 2005

Firstpage

1051

Abstract

Among all the technologies of data mining, predictive classification has a wide range of application. People do some prediction by building up classification models and hope to achieve high classification accuracy. However, there are always some data quality problems in the datasets, which will affect the accuracy of classification models. For example, missing data is a common problem in datasets. In this paper, we investigates the influence of missing data to classifiers. Firstly, basic knowledge about data quality and sensitivity analysis is introduced briefly. Then, the sensitivity of six representative classifiers to missing data is studied by sensitivity experiments. The results indicate that, in the datasets, when the proportion of missing data exceeds 20%, they do have a huge adverse impact on the classification accuracy of the model. Moreover, missing data have different effects on different datasets according to their characteristics. Among the six classifiers, the naive Bayesian classifier is the least sensitive to missing data.

Keywords

backpropagation; belief networks; data mining; decision trees; sensitivity analysis; classification accuracy; classification models; data mining; data quality problems; missing data; naive Bayesian classifier; predictive classification; sensitivity analysis; Classification algorithms; Data engineering; Data mining; Data warehouses; Databases; Delta modulation; Economic forecasting; Finance; Information management; Sensitivity analysis;

fLanguage

English

Publisher

ieee

Conference_Titel

Services Systems and Services Management, 2005. Proceedings of ICSSSM '05. 2005 International Conference on

Print_ISBN

0-7803-8971-9

Type

conf

DOI

10.1109/ICSSSM.2005.1500155

Filename

1500155