The Effect of Data Sampling When Using Random Forest on Imbalanced Bioinformatics Data

Author

David J. Dittman;Taghi M. Khoshgoftaar;Amri Napolitano

Author_Institution

Florida Atlantic Univ., Boca Raton, FL, USA

fYear

2015

Firstpage

457

Lastpage

463

Abstract

Ensemble learning is a powerful tool that has shown promise when applied towards bioinformatics datasets. In particular, the Random Forest classifier has been an effective and popular algorithm due to its relatively good classification performance and its ease of use. However, Random Forest does not account for class imbalance which is known for decreasing classification performance and increasing bias towards the majority class. In this study, we seek to determine if the inclusion of data sampling will improve the performance of the Random Forest classifier. In order to test the effect of data sampling, we used Random Undersampling along with two post-sampling class distribution ratios: 35:65 and 50:50 (minority:majority). Additionally, we also built inductive models with Random Forest when no data sampling technique was applied, so we can observe the true effect of the data sampling. All three options were tested on a series of fifteen imbalanced bioinformatics datasets. Our results show that data sampling does improve the classification performance of Random Forest, especially when using the 50:50 post-sampling class distribution ratio. However, statistical analysis shows that the increase in performance is not statistically significant. Thus, we can state that while data sampling does improve the classification performance of Random Forest, it is not a necessary step as the classifier is fairly robust to imbalanced data on its own.

Keywords

"Bioinformatics","Vegetation","Data models","Training","Biological system modeling","Robustness","Measurement"

Publisher

ieee

Conference_Titel

Information Reuse and Integration (IRI), 2015 IEEE International Conference on

Type

conf

DOI

10.1109/IRI.2015.76

Filename

7301012