مرکز منطقه ای اطلاع رساني علوم و فناوري - Evaluation of Wrapper-Based Feature Selection Using Hard, Moderate, and Easy Bioinformatics Data

Abstract :

One of the most challenging problems encountered when analyzing real-world gene expression datasets is high dimensionality (overabundance of features/attributes). This large number of features can lead to suboptimal classification performance and increased computation time. Feature selection, whereby only a subset of the original features are used for building a classification model, is the most commonly used technique to counter high dimensionality. One category of feature selection called wrapper-based techniques employ a classifier to directly find the subset of features which performs best. Unfortunately, noise can negatively impact the effectiveness of data mining techniques and subsequently lead to suboptimal results. Class noise in particular has a detrimental effect on the classification performance, making datasets perform poorly across a wide range of classifiers (i.e. Having a high "difficulty-of-learning."). No previous work has examined the effectiveness of wrapper-based feature selection when learning from real world high dimensional gene expression datasets in the context of difficulty-of-learning due to noise. To study this effectiveness, we perform experiments using ten gene expression datasets which was first determined to be easy-to-learn-from then had artificial class noise injected in a controlled fashion creating three levels of difficulty-of-learning (Easy, Moderate, and Hard). Using the Naïve Bayes learner, we perform wrapper feature selection followed by classification, using four classifiers (Naïve Bayes, Multilayer Perceptron, 5-Nearest Neighbor, and Support Vector Machines), and we compare these results to the classification performance without feature selection. The results show that wrapper-based feature selection effectiveness depends on the choice of learner: for Multilayer Perceptron, wrapper selection improved performance compared to not using feature selection, while for Naïve Bayes it slightly reduced p- rformance and for the remaining learners it further reduced performance. Because its performance relative to no feature selection varied depending on the choice of learner, we recommend that wrapper selection be at least considered in future bioinformatics experiments, especially if the goal is gene discovery not classification. Also, as dimensionality reduction techniques are not only useful but necessary for high-dimensional bioinformatics datasets, the no-feature-selection case may not be feasible in practice.

Keywords :

Bayes methods; bioinformatics; data mining; feature selection; genetics; multilayer perceptrons; pattern classification; support vector machines; 5-nearest neighbor classifiers; Naive Bayes learner classifiers; artificial class noise; bioinformatics data; classification model; data mining; detrimental effect; difficulty-of-learning levels; gene expression datasets; multilayer perceptron classifiers; suboptimal classification; support vector machines; wrapper-based feature selection; Bioinformatics; Buildings; Gene expression; Measurement; Noise; Support vector machines; Training; bioinformatics; difficulty of learning; noise injection; wrapper-based feature selection;