Comparison of Two Frameworks for Measuring the Stability of Gene-Selection Techniques on Noisy Class-Imbalanced Data

Author

Wald, Randall ; Khoshgoftaar, Taghi M. ; Shanab, Ahmad Abu

Author_Institution

Florida Atlantic Univ., Boca Raton, FL, USA

fYear

2013

fDate

4-6 Nov. 2013

Firstpage

881

Lastpage

888

Abstract

A common challenge encountered with feature (gene) selection is the instability of selected genes, which is defined as the degree of agreement between its outputs to differently-perturbed versions of the same input data. Very little work considers the impact of noise and sampling (e.g., preprocessing techniques used to cope with class imbalance, such as undersampling and oversampling) on the stability of gene selection techniques. In this study we compare two frameworks for evaluating this stability: "sampled-noisy vs. clean" and "sampled-noisy vs. sampled-noisy." Both frameworks involve noise injection followed by sampling, they differ in that the first compares the features selected from the perturbed (due to noise injection followed by sampling) datasets with the features selected from the original (clean) dataset, while the second performs a pairwise comparisons among the results from the perturbed datasets. Intuitively, the first framework should have more consistent results, since it is only randomizing one half of the comparison rather than both halves. The primary goal of this paper is to discover whether despite this, these two frameworks show similar patterns and conclusions. This is tested using four groups of cancer gene datasets. We employ ten feature rankers from three different families, apply three sampling techniques, and generate artificial class noise to better simulate real-world datasets. The results show that Mutual Information, Signal-To-Noise, and Deviance show the best stability across the two frameworks, while Gain Ratio shows the worst stability on average. The results also show that two frameworks have the same stability pattern, i.e., the rankers that perform well (or poorly) in the first framework perform as well (or as poorly) in the second. This means that the second framework, which is less computationally intensive (due to not performing feature selection on the clean data), can be used without requiring the first framework.

Keywords

bioinformatics; feature selection; learning (artificial intelligence); artificial class noise; cancer gene datasets; feature selection; gain ratio; gene-selection techniques; machine learning; mutual information; noise injection; noisy class-imbalanced data; sampled-noisy vs clean framework; sampled-noisy vs sampled-noisy framework; signal-to-noise; stability pattern; Cancer; Gene expression; Lungs; Noise; Noise measurement; Stability criteria;

fLanguage

English

Publisher

ieee

Conference_Titel

Tools with Artificial Intelligence (ICTAI), 2013 IEEE 25th International Conference on

Conference_Location

Herndon, VA

ISSN

1082-3409

Print_ISBN

978-1-4799-2971-9

Type

conf

DOI

10.1109/ICTAI.2013.134

Filename

6735345