DocumentCode
680762
Title
Comparison of Two Frameworks for Measuring the Stability of Gene-Selection Techniques on Noisy Class-Imbalanced Data
Author
Wald, Randall ; Khoshgoftaar, Taghi M. ; Shanab, Ahmad Abu
Author_Institution
Florida Atlantic Univ., Boca Raton, FL, USA
fYear
2013
fDate
4-6 Nov. 2013
Firstpage
881
Lastpage
888
Abstract
A common challenge encountered with feature (gene) selection is the instability of selected genes, which is defined as the degree of agreement between its outputs to differently-perturbed versions of the same input data. Very little work considers the impact of noise and sampling (e.g., preprocessing techniques used to cope with class imbalance, such as undersampling and oversampling) on the stability of gene selection techniques. In this study we compare two frameworks for evaluating this stability: "sampled-noisy vs. clean" and "sampled-noisy vs. sampled-noisy." Both frameworks involve noise injection followed by sampling, they differ in that the first compares the features selected from the perturbed (due to noise injection followed by sampling) datasets with the features selected from the original (clean) dataset, while the second performs a pairwise comparisons among the results from the perturbed datasets. Intuitively, the first framework should have more consistent results, since it is only randomizing one half of the comparison rather than both halves. The primary goal of this paper is to discover whether despite this, these two frameworks show similar patterns and conclusions. This is tested using four groups of cancer gene datasets. We employ ten feature rankers from three different families, apply three sampling techniques, and generate artificial class noise to better simulate real-world datasets. The results show that Mutual Information, Signal-To-Noise, and Deviance show the best stability across the two frameworks, while Gain Ratio shows the worst stability on average. The results also show that two frameworks have the same stability pattern, i.e., the rankers that perform well (or poorly) in the first framework perform as well (or as poorly) in the second. This means that the second framework, which is less computationally intensive (due to not performing feature selection on the clean data), can be used without requiring the first framework.
Keywords
bioinformatics; feature selection; learning (artificial intelligence); artificial class noise; cancer gene datasets; feature selection; gain ratio; gene-selection techniques; machine learning; mutual information; noise injection; noisy class-imbalanced data; sampled-noisy vs clean framework; sampled-noisy vs sampled-noisy framework; signal-to-noise; stability pattern; Cancer; Gene expression; Lungs; Noise; Noise measurement; Stability criteria;
fLanguage
English
Publisher
ieee
Conference_Titel
Tools with Artificial Intelligence (ICTAI), 2013 IEEE 25th International Conference on
Conference_Location
Herndon, VA
ISSN
1082-3409
Print_ISBN
978-1-4799-2971-9
Type
conf
DOI
10.1109/ICTAI.2013.134
Filename
6735345
Link To Document