Title :
Preprocessing of imbalanced breast cancer data using feature selection combined with over-sampling technique for classification
Author :
Jojan, Janjira ; Srivihok, Anongnart
Author_Institution :
Dept. of Comput. Sci., Kasetsart Univ., Bangkok, Thailand
Abstract :
Class imbalance problems have been found in many medical data in recent years. Data are imbalanced when the distributions of classes are highly imbalanced that means the number of instances of one class is very different to the other classes. Feature selection combined with over-sampling technique (FOT) is proposed to preprocess data before classifying our dataset, imbalanced breast cancer. We used feature selection techniques, Consistency Subset Evaluation, at the beginning to remove insignificant attributes of data. The remaining attributes were fed into over-sampling phase to adjust instances in the minority class. After preprocessed the dataset, we classified data using three classification algorithms, decision tree, BayesNet, and OneR. The f-values of classification data using FOT are 0.76, 0.638, and 0.64, respectively. These are greater than the f-values of three above classifications without FOT as, 0.561, 0.518, and 0.512, respectively. The experimental results indicated that FOT achieves better f-values than non-FOT preprocessing and have performed well in improving the performance of classifiers on this dataset, especially, decision tree.
Keywords :
Bayes methods; cancer; decision trees; medical computing; medical information systems; BayesNet; FOT; OneR; class imbalance problems; classification algorithms; consistency subset evaluation; decision tree; feature selection; imbalanced breast cancer data; medical data; over-sampling technique; Accuracy; Breast cancer; Classification algorithms; Decision trees; Machine learning algorithms; Measurement;
Conference_Titel :
Advanced Computer Science and Information Systems (ICACSIS), 2013 International Conference on
Conference_Location :
Bali
DOI :
10.1109/ICACSIS.2013.6761610