Title :
The Effect of Data Sampling When Using Random Forest on Imbalanced Bioinformatics Data
Author :
David J. Dittman;Taghi M. Khoshgoftaar;Amri Napolitano
Author_Institution :
Florida Atlantic Univ., Boca Raton, FL, USA
Abstract :
Ensemble learning is a powerful tool that has shown promise when applied towards bioinformatics datasets. In particular, the Random Forest classifier has been an effective and popular algorithm due to its relatively good classification performance and its ease of use. However, Random Forest does not account for class imbalance which is known for decreasing classification performance and increasing bias towards the majority class. In this study, we seek to determine if the inclusion of data sampling will improve the performance of the Random Forest classifier. In order to test the effect of data sampling, we used Random Undersampling along with two post-sampling class distribution ratios: 35:65 and 50:50 (minority:majority). Additionally, we also built inductive models with Random Forest when no data sampling technique was applied, so we can observe the true effect of the data sampling. All three options were tested on a series of fifteen imbalanced bioinformatics datasets. Our results show that data sampling does improve the classification performance of Random Forest, especially when using the 50:50 post-sampling class distribution ratio. However, statistical analysis shows that the increase in performance is not statistically significant. Thus, we can state that while data sampling does improve the classification performance of Random Forest, it is not a necessary step as the classifier is fairly robust to imbalanced data on its own.
Keywords :
"Bioinformatics","Vegetation","Data models","Training","Biological system modeling","Robustness","Measurement"
Conference_Titel :
Information Reuse and Integration (IRI), 2015 IEEE International Conference on
DOI :
10.1109/IRI.2015.76