• DocumentCode
    3570904
  • Title

    Improving software quality estimation by combining feature selection strategies with sampled ensemble learning

  • Author

    Khoshgoftaar, Taghi M. ; Kehan Gao ; Napolitano, Armi

  • Author_Institution
    Florida Atlantic Univ., Boca Raton, FL, USA
  • fYear
    2014
  • Firstpage
    428
  • Lastpage
    433
  • Abstract
    The efficiency (prediction accuracy) of a classification model is affected by the quality of training data. High dimensionality and class imbalance are two main problems that may cause low quality of training datasets, making data preprocessing a very important step for a classification problem. Feature (software metric) selection and data sampling are frequently used to overcome these problems. Feature selection (FS) is a process of selecting the most important attributes from the original dataset. Data sampling copes with class imbalance by adding/removing instances to/from training datasets. Another interesting method, called boosting (building multiple models, with each model tuned to work better on instances misclassifled by previous models), is found also effective for addressing the class imbalance problem. In this study, we investigate two types of FS approaches: individual FS and repetitive sampled FS. Following feature selection, models are built either using a plain learner or using a boosting algorithm, where random undersampling integrates with the AdaBoost algorithm. We focus on studying the impact of two FS methods (individual FS vs. repetitive sampled FS) and two model-building processes (boosting vs. plain learner) on software quality prediction. Six feature ranking techniques are examined in the experiment. The results demonstrate that the repetitive sampled FS generally has better performance than the individual FS technique when a plain learner is used for the subsequent learning process, and that boosting is more effective in improving classification performance than not using boosting.
  • Keywords
    data mining; feature selection; learning (artificial intelligence); pattern classification; software quality; AdaBoost algorithm; boosting algorithm; class imbalance problem; classification model; data preprocessing; data sampling; feature ranking techniques; feature selection; model building process; random undersampling; repetitive sampled FS approach; sampled ensemble learning; software quality prediction; subsequent learning process; training datasets; Analysis of variance; Boosting; Data models; Measurement; Radio frequency; Software; Support vector machines; RUSBoost; data sampling; feature selection; repetitive sampled feature selection; software defect prediction;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Information Reuse and Integration (IRI), 2014 IEEE 15th International Conference on
  • Type

    conf

  • DOI
    10.1109/IRI.2014.7051921
  • Filename
    7051921