Title :
An Empirical Study on Preprocessing High-Dimensional Class-Imbalanced Data for Classification
Author :
Hua Yin;Keke Gai
Author_Institution :
Inf. Sch., Guangdong Univ. of Finance &
Abstract :
The emerging new data types bring tremendous challenges to data mining. There is an enormous amount of high-dimensional class-imbalanced data in different fields. In this case, traditional classification methods are not appropriate because they are prone to ensure the accuracy of the majority class. Meanwhile, the curse of dimensionality makes situations more complicated. Finding a complicated classifier is not an easy way and such a classifier may overfit for the data. Preprocessing these data before classification is a more direct method. For the cross effect of high-dimension and class-imbalance, it is necessary to know about how preprocessing methods (feature selection and data sampling) affect the final classification. Previous experiments either had less considerations on datasets or introduced other characteristics to make the situation more complicated. We use two types of feature selection (wrapper and filter) and data sampling (oversampling and undersampling) methods on twelve selected datasets with different dimensions and imbalanced-level in four fields, and test the effects on the performance of c4.5 classifier. In our setting, experiments state that (1) feature selection before sampling is mostly better, (2) among the combinations of feature selection and data sampling, undersampling performs better than oversampling when the dataset is largely imbalanced, (3) when dataset is less imbalance, preprocessing may not be necessary, (4) In wrapper-based feature selection, we suggest using the simple searching method.
Keywords :
"Software","Cancer","Filtering algorithms","Data models","Training","Accuracy","Software algorithms"
Conference_Titel :
High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conferen on Embedded Software and Systems (ICESS), 2015 IEEE 17th International Conference on
DOI :
10.1109/HPCC-CSS-ICESS.2015.205