DocumentCode :
680762
Title :
Comparison of Two Frameworks for Measuring the Stability of Gene-Selection Techniques on Noisy Class-Imbalanced Data
Author :
Wald, Randall ; Khoshgoftaar, Taghi M. ; Shanab, Ahmad Abu
Author_Institution :
Florida Atlantic Univ., Boca Raton, FL, USA
fYear :
2013
fDate :
4-6 Nov. 2013
Firstpage :
881
Lastpage :
888
Abstract :
A common challenge encountered with feature (gene) selection is the instability of selected genes, which is defined as the degree of agreement between its outputs to differently-perturbed versions of the same input data. Very little work considers the impact of noise and sampling (e.g., preprocessing techniques used to cope with class imbalance, such as undersampling and oversampling) on the stability of gene selection techniques. In this study we compare two frameworks for evaluating this stability: "sampled-noisy vs. clean" and "sampled-noisy vs. sampled-noisy." Both frameworks involve noise injection followed by sampling, they differ in that the first compares the features selected from the perturbed (due to noise injection followed by sampling) datasets with the features selected from the original (clean) dataset, while the second performs a pairwise comparisons among the results from the perturbed datasets. Intuitively, the first framework should have more consistent results, since it is only randomizing one half of the comparison rather than both halves. The primary goal of this paper is to discover whether despite this, these two frameworks show similar patterns and conclusions. This is tested using four groups of cancer gene datasets. We employ ten feature rankers from three different families, apply three sampling techniques, and generate artificial class noise to better simulate real-world datasets. The results show that Mutual Information, Signal-To-Noise, and Deviance show the best stability across the two frameworks, while Gain Ratio shows the worst stability on average. The results also show that two frameworks have the same stability pattern, i.e., the rankers that perform well (or poorly) in the first framework perform as well (or as poorly) in the second. This means that the second framework, which is less computationally intensive (due to not performing feature selection on the clean data), can be used without requiring the first framework.
Keywords :
bioinformatics; feature selection; learning (artificial intelligence); artificial class noise; cancer gene datasets; feature selection; gain ratio; gene-selection techniques; machine learning; mutual information; noise injection; noisy class-imbalanced data; sampled-noisy vs clean framework; sampled-noisy vs sampled-noisy framework; signal-to-noise; stability pattern; Cancer; Gene expression; Lungs; Noise; Noise measurement; Stability criteria;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Tools with Artificial Intelligence (ICTAI), 2013 IEEE 25th International Conference on
Conference_Location :
Herndon, VA
ISSN :
1082-3409
Print_ISBN :
978-1-4799-2971-9
Type :
conf
DOI :
10.1109/ICTAI.2013.134
Filename :
6735345
Link To Document :
بازگشت