Title :
Effects of the Use of Boosting on Classification Performance of Imbalanced Bioinformatics Datasets
Author :
Khoshgoftaar, Taghi M. ; Fazelpour, Alireza ; Dittman, David J. ; Napolitano, Amri
Author_Institution :
Florida Atlantic Univ., Boca Raton, FL, USA
Abstract :
In the domain of bioinformatics, two common problems encountered when analyzing real-world datasets are class imbalance and high dimensionality. Boosting is a technique that can be used to improve classification performance, even in the presence of class imbalance. In addition, data sampling and feature selection are two important preprocessing techniques used to counter the adverse effects of both challenges collectively. In this study, we examine whether the inclusion of boosting along with joint deployment of feature selection and data sampling techniques affect the classification performance of inductive models. To this end, we used two approaches: filter-based feature selection followed by either data sampling (denoted as FS-DS) or a hybrid data sampling and boosting technique entitled RUSBoost (denoted as FRB) which integrates random under sampling within the boosting process. We conducted an extensive experimental study using six high dimensional and imbalanced bioinformatics datasets along with three learners and four feature subset sizes. Our results show that the improvement of classification performance due to boosting depends on the choice of learner used to build the model. We recommend FRB because it outperforms FS-DS for nearly all scenarios. Additionally, our ANOVA analysis shows that the FRB is statistically distinguishable from the FS-DS when using the LR learner. To our knowledge, this is the first study to investigate the effects of boosting along with combined feature selection and data sampling on classification performance of inductive models in the domain of bioinformatics.
Keywords :
bioinformatics; classification; feature selection; filters; inference mechanisms; learning (artificial intelligence); random processes; sampling methods; statistical analysis; ANOVA analysis; FRB; FS-DS; LR learner; RUSBoost; boosting effect; class imbalance; classification performance; data sampling technique effect; feature selection effect; feature subset size; filter-based feature selection; high dimensional bioinformatics dataset; hybrid data sampling-boosting technique; imbalanced bioinformatics dataset; inductive model; learner selection dependence; preprocessing techniques; random under sampling integration; real-world dataset analysis; Analysis of variance; Bioinformatics; Biological system modeling; Boosting; Buildings; Data models; Joints; RUSBoost; boosting; class imbalance; data sampling; feature subset size; high dimensionality;
Conference_Titel :
Bioinformatics and Bioengineering (BIBE), 2014 IEEE International Conference on
Conference_Location :
Boca Raton, FL
DOI :
10.1109/BIBE.2014.68