Author :
Wald, Randall ; Khoshgoftaar, Taghi M. ; Shanab, Ahmad Abu
Author_Institution :
Florida Atlantic Univ., Boca Raton, FL, USA
Abstract :
A common challenge encountered with feature (gene) selection is the instability of selected genes, which is defined as the degree of agreement between its outputs to differently-perturbed versions of the same input data. Very little work considers the impact of noise and sampling (e.g., preprocessing techniques used to cope with class imbalance, such as undersampling and oversampling) on the stability of gene selection techniques. In this study we compare two frameworks for evaluating this stability: "sampled-noisy vs. clean" and "sampled-noisy vs. sampled-noisy." Both frameworks involve noise injection followed by sampling, they differ in that the first compares the features selected from the perturbed (due to noise injection followed by sampling) datasets with the features selected from the original (clean) dataset, while the second performs a pairwise comparisons among the results from the perturbed datasets. Intuitively, the first framework should have more consistent results, since it is only randomizing one half of the comparison rather than both halves. The primary goal of this paper is to discover whether despite this, these two frameworks show similar patterns and conclusions. This is tested using four groups of cancer gene datasets. We employ ten feature rankers from three different families, apply three sampling techniques, and generate artificial class noise to better simulate real-world datasets. The results show that Mutual Information, Signal-To-Noise, and Deviance show the best stability across the two frameworks, while Gain Ratio shows the worst stability on average. The results also show that two frameworks have the same stability pattern, i.e., the rankers that perform well (or poorly) in the first framework perform as well (or as poorly) in the second. This means that the second framework, which is less computationally intensive (due to not performing feature selection on the clean data), can be used without requiring the first framework.
Keywords :
bioinformatics; feature selection; learning (artificial intelligence); artificial class noise; cancer gene datasets; feature selection; gain ratio; gene-selection techniques; machine learning; mutual information; noise injection; noisy class-imbalanced data; sampled-noisy vs clean framework; sampled-noisy vs sampled-noisy framework; signal-to-noise; stability pattern; Cancer; Gene expression; Lungs; Noise; Noise measurement; Stability criteria;