DocumentCode
570169
Title
A novel dataset-similarity-aware approach for evaluating stability of software metric selection techniques
Author
Wang, Huanjing ; Khoshgoftaar, Taghi M. ; Wald, Randall ; Napolitano, Amri
Author_Institution
Western Kentucky Univ., Bowling Green, KY, USA
fYear
2012
fDate
8-10 Aug. 2012
Firstpage
1
Lastpage
8
Abstract
Software metric (feature) selection is an important pre-processing step before building software defect prediction models. Although much research has been done analyzing the classification performance of feature selection methods, fewer works have focused on their stability (robustness). Stability is important because feature selection methods which reliably produce the same results despite changes to the data are more trustworthy. Of the papers studying stability, most either compare the features chosen from different random subsamples of the dataset or compare each random subsample with the original dataset. These either result in an unknown degree of overlap between the subsamples, or comparing datasets of different sizes. In this work, we propose a fixed-overlap partition algorithm which generates a pair of subsamples with the same number of instances and a specified degree of overlap. We empirically evaluate the stability of 19 feature selection methods in terms of degree of overlap and feature subset size using sixteen real software metrics datasets. Consistency index is used as the stability measure, and we show that RF is the most stable filter. Results also show that degree of overlap and feature subset size do affect the stability of feature selection methods.
Keywords
data mining; software metrics; RF; consistency index; data mining; dataset random subsamples; dataset-similarity-aware approach; feature selection method classification performance; feature subset size; fixed-overlap partition algorithm; overlap degree; software defect prediction models; software metric selection techniques; stability evaluation; subsample pair generation; Indexes; Partitioning algorithms; Software; Software metrics; Stability criteria;
fLanguage
English
Publisher
ieee
Conference_Titel
Information Reuse and Integration (IRI), 2012 IEEE 13th International Conference on
Conference_Location
Las Vegas, NV
Print_ISBN
978-1-4673-2282-9
Electronic_ISBN
978-1-4673-2283-6
Type
conf
DOI
10.1109/IRI.2012.6302983
Filename
6302983
Link To Document