DocumentCode :
2652581
Title :
Impact of Data Sampling on Stability of Feature Selection for Software Measurement Data
Author :
Kehan Gao ; Khoshgoftaar, Taghi M. ; Napolitano, Antonio
Author_Institution :
Eastern Connecticut State Univ., Willimantic, CT, USA
fYear :
2011
fDate :
7-9 Nov. 2011
Firstpage :
1004
Lastpage :
1011
Abstract :
Software defect prediction can be considered a binary classification problem. Generally, practitioners utilize historical software data, including metric and fault data collected during the software development process, to build a classification model and then employ this model to predict new program modules as either fault-prone (fp) or not-fault-prone (nfp). Limited project resources can then be allocated according to the prediction results by (for example) assigning more reviews and testing to the modules predicted to be potentially defective. Two challenges often come with the modeling process: (1) high-dimensionality of software measurement data and (2) skewed or imbalanced distributions between the two types of modules (fp and nfp) in those datasets. To overcome these problems, extensive studies have been dedicated towards improving the quality of training data. The commonly used techniques are feature selection and data sampling. Usually, researchers focus on evaluating classification performance after the training data is modified. The present study assesses a feature selection technique from a different perspective. We are more interested in studying the stability of a feature selection method, especially in understanding the impact of data sampling techniques on the stability of feature selection when using the sampled data. Some interesting findings are found based on two case studies performed on datasets from two real-world software projects.
Keywords :
pattern classification; sampling methods; software development management; software fault tolerance; software metrics; binary classification problem; classification performance; data sampling; fault data; fault prone program modules; feature selection stability; metric data; not-fault-prone program modules; real world software projects; software defect prediction; software measurement data; Indexes; Integrated circuits; Software; Software measurement; Stability criteria; data sampling; defect prediction; feature selection; software metrics; stability;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Tools with Artificial Intelligence (ICTAI), 2011 23rd IEEE International Conference on
Conference_Location :
Boca Raton, FL
ISSN :
1082-3409
Print_ISBN :
978-1-4577-2068-0
Electronic_ISBN :
1082-3409
Type :
conf
DOI :
10.1109/ICTAI.2011.172
Filename :
6103463
Link To Document :
بازگشت