مرکز منطقه ای اطلاع رساني علوم و فناوري - Evaluating Stratification Alternatives to Improve Software Defect Prediction

DocumentCode :

1455913

Title :

Evaluating Stratification Alternatives to Improve Software Defect Prediction

Author :

Pelayo, L. ; Dick, Scott

Author_Institution :

Dept. of Electr. & Comput. Eng., Univ. of Alberta, Edmonton, AB, Canada

Volume :

Issue :

fYear :

2012

fDate :

6/1/2012 12:00:00 AM

Firstpage :

516

Lastpage :

525

Abstract :

Numerous studies have applied machine learning to the software defect prediction problem, i.e. predicting which modules will experience a failure during operation based on software metrics. However, skewness in defect-prediction datasets can mean that the resulting classifiers often predict the faulty (minority) class less accurately. This problem is well known in machine learning, and is often referred to as “learning from imbalanced datasets.” One common approach for mitigating skewness is to use stratification to homogenize class distributions; however, it is unclear what stratification techniques are most effective, both generally and specifically in software defect prediction. In this article, we investigate two major stratification alternatives (under-, and over-sampling) for software defect prediction using Analysis of Variance. Our analysis covers several modern software defect prediction datasets using a factorial design. We find that the main effect of under-sampling is significant at α = 0.05, as is the interaction between under- and over-sampling. However, the main effect of over-sampling is not significant.

Keywords :

learning (artificial intelligence); software metrics; software quality; statistical analysis; analysis of variance; factorial design; imbalanced datasets; machine learning; software defect prediction; software metrics; stratification alternatives; stratification techniques; Accuracy; Algorithm design and analysis; Analysis of variance; Machine learning; Measurement; Object oriented modeling; Software; Learning in imbalanced datasets; machine learning; non-parametric models; software fault-proneness; software reliability; stratification;

fLanguage :

English

Journal_Title :

Reliability, IEEE Transactions on

Publisher :

ieee

ISSN :

0018-9529

Type :

jour

DOI :

10.1109/TR.2012.2183912

Filename :

6156808

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1455913