Title :
Using simulated data sets to compare data analysis techniques used for software cost modelling
Author :
Pickard, L. ; Kitchenham, B. ; Linkman, S.J.
Author_Institution :
Dept. of Comput. Sci., Keele Univ., UK
fDate :
12/1/2001 12:00:00 AM
Abstract :
The goals of the study presented were to compare different data analysis methods and to demonstrate the viability of simulation as a mechanism to allow such comparisons. Simulation was used to create data sets with a known underlying model and with non-Normal characteristics that are frequently found in software data sets: skewness, unstable variance, and outliers and combinations of these characteristics. Three data analysis approaches were investigated: residual analysis; multiple regression; classification and regression trees (CART). In addition to the standard statistical ´least squares´ version of each method, robust and non-parametric versions of the techniques were also investigated. It was found that standard multiple regression techniques were best if the data only exhibited moderate non-Normality. As might be expected, under more extreme conditions such as severe heteroscedasticity, the non-parametric techniques performed best. It was more surprising to find that under strongly non-Normal conditions the robust and nonparametric residual analysis techniques performed as well as the conventional robust and nonparametric versions of multiple regression. However, the most important result of the study is to demonstrate the value of simulation as a technique for evaluating different data analysis techniques under controlled conditions
Keywords :
data analysis; digital simulation; least squares approximations; nonparametric statistics; software cost estimation; software metrics; statistical analysis; CART; classification and regression trees; controlled conditions; data analysis techniques; data sets; multiple regression; non-Normal characteristics; nonparametric residual analysis techniques; nonparametric versions; residual analysis; severe heteroscedasticity; simulated data sets; simulation; software cost modelling; software data sets; standard multiple regression techniques; standard statistical least squares version;
Journal_Title :
Software, IEE Proceedings -
DOI :
10.1049/ip-sen:20010621