• DocumentCode
    260301
  • Title

    Using Correlation-Based Feature Selection for a Diverse Collection of Bioinformatics Datasets

  • Author

    Wald, Randall ; Khoshgoftaar, Taghi M. ; Napolitano, Amri

  • Author_Institution
    Florida Atlantic Univ., Boca Raton, FL, USA
  • fYear
    2014
  • fDate
    10-12 Nov. 2014
  • Firstpage
    156
  • Lastpage
    162
  • Abstract
    The large number of genes found in most gene micro array datasets demands the use of feature selection techniques to alleviate this problem of high-dimensionality. However, the computational cost of filter-based subset evaluation techniques such as Correlation-Based Feature Selection (CFS) has generally limited the use of these techniques to smaller datasets, or at least smaller collections of gene micro array datasets. No previous work has applied CFS to a large and diverse range of bioinformatics datasets. To address this deficit, we employ nine different micro array datasets exhibiting a wide range of characteristics in terms of dataset balance (fraction of instances found in the minority class) and dataset difficulty of learning (overall difficulty of building effective classification models on raw, pre-feature-selection datasets). We also use five classification learners to discover how these perform in conjunction with CFS, along with five performance metrics to give a broad perspective on our results. The results find that CFS can be used to help build effective models, in particular when used with the 5-Nearest Neighbors learner on data that is Easy or Moderate (in terms of difficulty-of-learning) or Balanced (in terms of class distribution). For other types of data, the optimal learner varies, although in most cases the Logistic Regression learner works worst in conjunction with CFS.
  • Keywords
    bioinformatics; biological techniques; cellular biophysics; correlation methods; feature selection; genetics; 5-Nearest Neighbors; bioinformatics datasets; correlation-based feature selection; data set dimensionality; dataset balance; dataset difficulty of learning; diverse collection; feature selection techniques; filter-based subset evaluation techniques; five classification learners; gene microarray datasets; logistic regression learner; optimal learner variations; performance metrics; pre-feature-selection datasets; Bioinformatics; Buildings; Cancer; Correlation; Measurement; Niobium; Support vector machines; Balance; Bioinformatics; Correlation-Based Feature Selection; Difficulty of Learning;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Bioinformatics and Bioengineering (BIBE), 2014 IEEE International Conference on
  • Conference_Location
    Boca Raton, FL
  • Type

    conf

  • DOI
    10.1109/BIBE.2014.63
  • Filename
    7033574