Improving software quality estimation by combining feature selection strategies with sampled ensemble learning

Author

Khoshgoftaar, Taghi M. ; Kehan Gao ; Napolitano, Armi

Author_Institution

Florida Atlantic Univ., Boca Raton, FL, USA

fYear

2014

Firstpage

428

Lastpage

433

Abstract

The efficiency (prediction accuracy) of a classification model is affected by the quality of training data. High dimensionality and class imbalance are two main problems that may cause low quality of training datasets, making data preprocessing a very important step for a classification problem. Feature (software metric) selection and data sampling are frequently used to overcome these problems. Feature selection (FS) is a process of selecting the most important attributes from the original dataset. Data sampling copes with class imbalance by adding/removing instances to/from training datasets. Another interesting method, called boosting (building multiple models, with each model tuned to work better on instances misclassifled by previous models), is found also effective for addressing the class imbalance problem. In this study, we investigate two types of FS approaches: individual FS and repetitive sampled FS. Following feature selection, models are built either using a plain learner or using a boosting algorithm, where random undersampling integrates with the AdaBoost algorithm. We focus on studying the impact of two FS methods (individual FS vs. repetitive sampled FS) and two model-building processes (boosting vs. plain learner) on software quality prediction. Six feature ranking techniques are examined in the experiment. The results demonstrate that the repetitive sampled FS generally has better performance than the individual FS technique when a plain learner is used for the subsequent learning process, and that boosting is more effective in improving classification performance than not using boosting.

Keywords

data mining; feature selection; learning (artificial intelligence); pattern classification; software quality; AdaBoost algorithm; boosting algorithm; class imbalance problem; classification model; data preprocessing; data sampling; feature ranking techniques; feature selection; model building process; random undersampling; repetitive sampled FS approach; sampled ensemble learning; software quality prediction; subsequent learning process; training datasets; Analysis of variance; Boosting; Data models; Measurement; Radio frequency; Software; Support vector machines; RUSBoost; data sampling; feature selection; repetitive sampled feature selection; software defect prediction;

fLanguage

English

Publisher

ieee

Conference_Titel

Information Reuse and Integration (IRI), 2014 IEEE 15th International Conference on

Type

conf

DOI

10.1109/IRI.2014.7051921

Filename

7051921

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=3570904