DocumentCode :
1455913
Title :
Evaluating Stratification Alternatives to Improve Software Defect Prediction
Author :
Pelayo, L. ; Dick, Scott
Author_Institution :
Dept. of Electr. & Comput. Eng., Univ. of Alberta, Edmonton, AB, Canada
Volume :
61
Issue :
2
fYear :
2012
fDate :
6/1/2012 12:00:00 AM
Firstpage :
516
Lastpage :
525
Abstract :
Numerous studies have applied machine learning to the software defect prediction problem, i.e. predicting which modules will experience a failure during operation based on software metrics. However, skewness in defect-prediction datasets can mean that the resulting classifiers often predict the faulty (minority) class less accurately. This problem is well known in machine learning, and is often referred to as “learning from imbalanced datasets.” One common approach for mitigating skewness is to use stratification to homogenize class distributions; however, it is unclear what stratification techniques are most effective, both generally and specifically in software defect prediction. In this article, we investigate two major stratification alternatives (under-, and over-sampling) for software defect prediction using Analysis of Variance. Our analysis covers several modern software defect prediction datasets using a factorial design. We find that the main effect of under-sampling is significant at α = 0.05, as is the interaction between under- and over-sampling. However, the main effect of over-sampling is not significant.
Keywords :
learning (artificial intelligence); software metrics; software quality; statistical analysis; analysis of variance; factorial design; imbalanced datasets; machine learning; software defect prediction; software metrics; stratification alternatives; stratification techniques; Accuracy; Algorithm design and analysis; Analysis of variance; Machine learning; Measurement; Object oriented modeling; Software; Learning in imbalanced datasets; machine learning; non-parametric models; software fault-proneness; software reliability; stratification;
fLanguage :
English
Journal_Title :
Reliability, IEEE Transactions on
Publisher :
ieee
ISSN :
0018-9529
Type :
jour
DOI :
10.1109/TR.2012.2183912
Filename :
6156808
Link To Document :
بازگشت