Feature Selection with Imbalanced Data for Software Defect Prediction

Author

Khoshgoftaar, Taghi M. ; Gao, Kehan

Author_Institution

Florida Atlantic Univ., Boca Raton, FL, USA

fYear

2009

fDate

13-15 Dec. 2009

Firstpage

235

Lastpage

240

Abstract

In this paper, we study the learning impact of data sampling followed by attribute selection on the classification models built with binary class imbalanced data within the scenario of software quality engineering. We use a wrapper-based attribute ranking technique to select a subset of attributes, and the random undersampling technique (RUS) on the majority class to alleviate the negative effects of imbalanced data on the prediction models. The datasets used in the empirical study were collected from numerous software projects. Five data preprocessing scenarios were explored in these experiments, including: (1) training on the original, unaltered fit dataset, (2) training on a sampled version of the fit dataset, (3) training on an unsampled version of the fit dataset using only the attributes chosen by feature selection based on the unsampled fit dataset, (4) training on an unsampled version of the fit dataset using only the attributes chosen by feature selection based on a sampled version of the fit dataset, and (5) training on a sampled version of the fit dataset using only the attributes chosen by feature selection based on the sampled version of the fit dataset. We compared the performances of the classification models constructed over these five different scenarios. The results demonstrate that the classification models constructed on the sampled fit data with or without feature selection (case 2 and case 5) significantly outperformed the classification models built with the other cases (unsampled fit data). Moreover, the two scenarios using sampled data (case 2 and case 5) showed very similar performances, but the subset of attributes (case 5) is only around 15% or 30% of the complete set of attributes (case 2).

Keywords

fault diagnosis; software fault tolerance; software quality; attribute selection; binary class imbalanced data; classification model; data sampling; feature selection; learning impact; random undersampling technique; software defect prediction; software quality engineering; wrapper-based attribute ranking; Application software; Data engineering; Data mining; Data preprocessing; Machine learning; Predictive models; Project management; Sampling methods; Software measurement; Software quality; feature selection; imbalanced data; software defect prediction; wrapper-based attribute ranking;

fLanguage

English

Publisher

ieee

Conference_Titel

Machine Learning and Applications, 2009. ICMLA '09. International Conference on

Conference_Location

Miami Beach, FL

Print_ISBN

978-0-7695-3926-3

Type

conf

DOI

10.1109/ICMLA.2009.18

Filename

5381844