• DocumentCode
    3123530
  • Title

    Feature Selection with Imbalanced Data for Software Defect Prediction

  • Author

    Khoshgoftaar, Taghi M. ; Gao, Kehan

  • Author_Institution
    Florida Atlantic Univ., Boca Raton, FL, USA
  • fYear
    2009
  • fDate
    13-15 Dec. 2009
  • Firstpage
    235
  • Lastpage
    240
  • Abstract
    In this paper, we study the learning impact of data sampling followed by attribute selection on the classification models built with binary class imbalanced data within the scenario of software quality engineering. We use a wrapper-based attribute ranking technique to select a subset of attributes, and the random undersampling technique (RUS) on the majority class to alleviate the negative effects of imbalanced data on the prediction models. The datasets used in the empirical study were collected from numerous software projects. Five data preprocessing scenarios were explored in these experiments, including: (1) training on the original, unaltered fit dataset, (2) training on a sampled version of the fit dataset, (3) training on an unsampled version of the fit dataset using only the attributes chosen by feature selection based on the unsampled fit dataset, (4) training on an unsampled version of the fit dataset using only the attributes chosen by feature selection based on a sampled version of the fit dataset, and (5) training on a sampled version of the fit dataset using only the attributes chosen by feature selection based on the sampled version of the fit dataset. We compared the performances of the classification models constructed over these five different scenarios. The results demonstrate that the classification models constructed on the sampled fit data with or without feature selection (case 2 and case 5) significantly outperformed the classification models built with the other cases (unsampled fit data). Moreover, the two scenarios using sampled data (case 2 and case 5) showed very similar performances, but the subset of attributes (case 5) is only around 15% or 30% of the complete set of attributes (case 2).
  • Keywords
    fault diagnosis; software fault tolerance; software quality; attribute selection; binary class imbalanced data; classification model; data sampling; feature selection; learning impact; random undersampling technique; software defect prediction; software quality engineering; wrapper-based attribute ranking; Application software; Data engineering; Data mining; Data preprocessing; Machine learning; Predictive models; Project management; Sampling methods; Software measurement; Software quality; feature selection; imbalanced data; software defect prediction; wrapper-based attribute ranking;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Machine Learning and Applications, 2009. ICMLA '09. International Conference on
  • Conference_Location
    Miami Beach, FL
  • Print_ISBN
    978-0-7695-3926-3
  • Type

    conf

  • DOI
    10.1109/ICMLA.2009.18
  • Filename
    5381844