DocumentCode :
1744292
Title :
Controlling overfitting in software quality models: experiments with regression trees and classification
Author :
Khoshgoftaar, Taghi M. ; Allen, Edward B. ; Deng, Jianyu
Author_Institution :
Florida Atlantic Univ., Boca Raton, FL, USA
fYear :
2001
fDate :
2001
Firstpage :
190
Lastpage :
198
Abstract :
In these days of “faster, cheaper, better” release cycles, software developers must focus enhancement efforts on those modules that need improvement the most. Predictions of which modules are likely to have faults during operations is an important tool to guide such improvement efforts during maintenance. Tree-based models are attractive because they readily model nonmonotonic relationships between a response variable and its predictors. However, tree-based models are vulnerable to overfitting, where the model reflects the structure of the training data set too closely. Even though a model appears to be accurate on training data, if overfitted it may be much less accurate when applied to a current data set. To account for the severe consequences of misclassifying fault-prone modules, our measure of overfitting is based on the expected costs of misclassification, rather than the total number of misclassifications. In this paper, we apply a regression-tree algorithm in the S-Plus system to the classification of software modules by the application of our classification rule that accounts for the preferred balance between misclassification rates. We conducted a case study of a very large legacy telecommunications system, and investigated two parameters of the regression-tree algorithm. We found that minimum deviance was strongly related to overfitting and can be used to control it, but the effect of minimum node size on overfitting is ambiguous
Keywords :
pattern classification; program diagnostics; software maintenance; software quality; statistical analysis; subroutines; telecommunication computing; trees (mathematics); S-Plus system; accuracy; case study; classification; fault prediction; fault-prone modules; large legacy telecommunications system; minimum deviance; minimum node size; misclassification cost; misclassification rate; nonmonotonic relationships; overfitting control; program module improvement; regression trees; response variable predictors; software enhancement; software maintenance; software metrics; software quality models; software release cycles; software reliability; training data structure; tree-based models; Application software; Classification tree analysis; Costs; Predictive models; Regression tree analysis; Size control; Software algorithms; Software quality; Telecommunication control; Training data;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Software Metrics Symposium, 2001. METRICS 2001. Proceedings. Seventh International
Conference_Location :
London
ISSN :
1530-1435
Print_ISBN :
0-7695-1043-4
Type :
conf
DOI :
10.1109/METRIC.2001.915528
Filename :
915528
Link To Document :
بازگشت