DocumentCode :
589273
Title :
An Empirical Study on the Stability of Feature Selection for Imbalanced Software Engineering Data
Author :
Huanjing Wang ; Khoshgoftaar, Taghi M. ; Napolitano, Antonio
Volume :
1
fYear :
2012
fDate :
12-15 Dec. 2012
Firstpage :
317
Lastpage :
323
Abstract :
In software quality modeling, software metrics are collected during the software development cycle. However, not all metrics are relevant to the class attribute (software quality). Metric (feature) selection has become the cornerstone of many software quality classification problems. Selecting software metrics that are important for software quality classification is a necessary and critical step before the model training process. Recently, the robustness (e.g., stability) of feature selection techniques has been studied, to examine the sensitivity of these techniques to changes (adding/removing program modules to/from their dataset). This work provides an empirical study regarding the stability of feature selection techniques across six software metrics datasets with varying levels of class balance. In this work eighteen feature selection techniques are evaluated. Moreover, three factors, feature subset size, degree of perturbation, and class balance of datasets, are considered in this study to evaluate stability of feature selection techniques. Experimental results show that these factors affect the stability of feature selection techniques as one might expect. We found that with few exceptions, feature ranking based on highly imbalanced datasets are less stable than based on slightly imbalanced data. Results also show that making smaller changes to the datasets has less impact on the stability of feature ranking techniques. Overall, we conclude that a careful understanding of one´s dataset (and certain choices of metric selection technique) can help practitioners build more reliable software quality models.
Keywords :
pattern classification; software metrics; software quality; feature ranking; feature selection stability; imbalanced software engineering data; model training process; software development cycle; software metrics; software quality classification problems; software quality modeling; Indexes; Measurement; Radio frequency; Software quality; Stability criteria; feature ranking; imbalanced data; stability; subsample;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Machine Learning and Applications (ICMLA), 2012 11th International Conference on
Conference_Location :
Boca Raton, FL
Print_ISBN :
978-1-4673-4651-1
Type :
conf
DOI :
10.1109/ICMLA.2012.60
Filename :
6406682
Link To Document :
بازگشت