Title :
Random Forest with 200 Selected Features: An Optimal Model for Bioinformatics Research
Author :
Wald, Randall ; Khoshgoftaar, Taghi ; Dittman, David J. ; Napolitano, Antonio
Author_Institution :
Florida Atlantic Univ., Boca Raton, FL, USA
Abstract :
Many problems in bioinformatics involve high-dimensional, difficult-to-process collections of data. For example, gene micro arrays can record the expression levels of thousands of genes, many of which have no relevance to the underlying medical or biological question. Building classification model son such datasets can thus take excessive computational time and still give poor results. Many strategies exist to combat these problems, including feature selection (which chooses only the most relevant genes for building models) and ensemble learners (which combine multiple weak classification learners into one collection which should give a broader view of the data). However, these techniques present a new challenge: choosing which combination of strategies is most appropriate for a given collection of data. This is especially difficult for health informatics and bioinformatics practitioners who do not have an extensive machine learning background. An ideal model should be easy to use and apply, helping the practitioner by either making these choices in advance or by being insensitive to these choices. In this work we demonstrate that the Random Forest learner, when using 100 trees and 200 features (selected by any reasonable feature ranking technique, as the specific choice does not matter), is such a model. To show this, we use 25 bioinformatics datasets from a number of different cancer diagnosis and identification problems, and we compare Random Forest with 5 other learners. We also tested 25 feature ranking techniques and 12 feature subset sizes, to optimize the feature selection step. Our results show that Random Forest with 100 trees and 200 selected features is statistically significantly better than any of the alternatives (orin the case of using 200 features, is statistically equivalent with the top choices), and that the specific choice of ranking technique is statistically insignificant.
Keywords :
bioinformatics; cancer; feature selection; learning (artificial intelligence); patient diagnosis; pattern classification; statistical analysis; bioinformatics research; cancer diagnosis; cancer identification problem; classification model; ensemble learners; feature ranking technique; feature selection; health informatics; multiple weak classification learners; optimal model; random forest learner; Bioinformatics; Biological system modeling; Buildings; Data models; Niobium; Support vector machines; Vegetation; Bioinformatics; Random Forest; feature selection;
Conference_Titel :
Machine Learning and Applications (ICMLA), 2013 12th International Conference on
Conference_Location :
Miami, FL
DOI :
10.1109/ICMLA.2013.34