DocumentCode
260301
Title
Using Correlation-Based Feature Selection for a Diverse Collection of Bioinformatics Datasets
Author
Wald, Randall ; Khoshgoftaar, Taghi M. ; Napolitano, Amri
Author_Institution
Florida Atlantic Univ., Boca Raton, FL, USA
fYear
2014
fDate
10-12 Nov. 2014
Firstpage
156
Lastpage
162
Abstract
The large number of genes found in most gene micro array datasets demands the use of feature selection techniques to alleviate this problem of high-dimensionality. However, the computational cost of filter-based subset evaluation techniques such as Correlation-Based Feature Selection (CFS) has generally limited the use of these techniques to smaller datasets, or at least smaller collections of gene micro array datasets. No previous work has applied CFS to a large and diverse range of bioinformatics datasets. To address this deficit, we employ nine different micro array datasets exhibiting a wide range of characteristics in terms of dataset balance (fraction of instances found in the minority class) and dataset difficulty of learning (overall difficulty of building effective classification models on raw, pre-feature-selection datasets). We also use five classification learners to discover how these perform in conjunction with CFS, along with five performance metrics to give a broad perspective on our results. The results find that CFS can be used to help build effective models, in particular when used with the 5-Nearest Neighbors learner on data that is Easy or Moderate (in terms of difficulty-of-learning) or Balanced (in terms of class distribution). For other types of data, the optimal learner varies, although in most cases the Logistic Regression learner works worst in conjunction with CFS.
Keywords
bioinformatics; biological techniques; cellular biophysics; correlation methods; feature selection; genetics; 5-Nearest Neighbors; bioinformatics datasets; correlation-based feature selection; data set dimensionality; dataset balance; dataset difficulty of learning; diverse collection; feature selection techniques; filter-based subset evaluation techniques; five classification learners; gene microarray datasets; logistic regression learner; optimal learner variations; performance metrics; pre-feature-selection datasets; Bioinformatics; Buildings; Cancer; Correlation; Measurement; Niobium; Support vector machines; Balance; Bioinformatics; Correlation-Based Feature Selection; Difficulty of Learning;
fLanguage
English
Publisher
ieee
Conference_Titel
Bioinformatics and Bioengineering (BIBE), 2014 IEEE International Conference on
Conference_Location
Boca Raton, FL
Type
conf
DOI
10.1109/BIBE.2014.63
Filename
7033574
Link To Document