• DocumentCode
    589124
  • Title

    Evaluation of Feature Ranking Ensembles for High-Dimensional Biomedical Data: A Case Study

  • Author

    Kuncheva, Ludmila I. ; Smith, C.J. ; Syed, Y. ; Phillips, C.O. ; Lewis, K.E.

  • Author_Institution
    Sch. of Comput. Sci., Bangor Univ., Bangor, UK
  • fYear
    2012
  • fDate
    10-10 Dec. 2012
  • Firstpage
    49
  • Lastpage
    56
  • Abstract
    Developing accurate, reliable and easy to use diagnostic tests is based upon identifying a small set of highly discriminative biomarkers. This task can be cast as feature selection within a pattern recognition context. Medical data are usually of the "wide" type where the number of features is substantially larger than the number of instances. With the abundance of feature ranking methods, it is difficult to pick the most suitable one and choose a final consistent feature subset. Ensembles of ranking methods have been recommended for the task but their stability and accuracy have not been evaluated across different ranking methods. Here we present a case study consisting of 429 samples of exhaled air from smokers, 83% of whom suffer from COPD (chronic obstructive pulmonary disease). The task is to identify a discriminative subset of the 1929 volatile organic compounds (VOCs) measured through mass spectrometry. Using Pareto analysis, 16 feature ranking ensembles were evaluated with respect to three criteria: classification accuracy, area under the ROC curve and the stability of the ensemble selection. The t-statistic was rated the best among the 16 feature rankers, outperforming the currently favourite SVM ranker.
  • Keywords
    Pareto analysis; data handling; feature extraction; medical diagnostic computing; pattern classification; COPD; Pareto analysis; VOC; area-under-the ROC curve; chronic obstructive pulmonary disease; classification accuracy; diagnostic tests; discriminative biomarkers; ensemble selection stability; feature ranking ensemble evaluation; feature ranking methods; feature selection; high-dimensional biomedical data; mass spectrometry; pattern recognition context; t-statistic; volatile organic compounds; Accuracy; Educational institutions; Indexes; Stability criteria; Support vector machines; Vegetation; COPD; Feature selection; classifier ensembles; feature ranking; stability index;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining Workshops (ICDMW), 2012 IEEE 12th International Conference on
  • Conference_Location
    Brussels
  • Print_ISBN
    978-1-4673-5164-5
  • Type

    conf

  • DOI
    10.1109/ICDMW.2012.12
  • Filename
    6406422