مرکز منطقه ای اطلاع رساني علوم و فناوري - Controlling overfitting in software quality models: experiments with regression trees and classification

DocumentCode :

1744292

Title :

Controlling overfitting in software quality models: experiments with regression trees and classification

Author :

Khoshgoftaar, Taghi M. ; Allen, Edward B. ; Deng, Jianyu

Author_Institution :

Florida Atlantic Univ., Boca Raton, FL, USA

fYear :

2001

fDate :

2001

Firstpage :

190

Lastpage :

198

Abstract :

In these days of “faster, cheaper, better” release cycles, software developers must focus enhancement efforts on those modules that need improvement the most. Predictions of which modules are likely to have faults during operations is an important tool to guide such improvement efforts during maintenance. Tree-based models are attractive because they readily model nonmonotonic relationships between a response variable and its predictors. However, tree-based models are vulnerable to overfitting, where the model reflects the structure of the training data set too closely. Even though a model appears to be accurate on training data, if overfitted it may be much less accurate when applied to a current data set. To account for the severe consequences of misclassifying fault-prone modules, our measure of overfitting is based on the expected costs of misclassification, rather than the total number of misclassifications. In this paper, we apply a regression-tree algorithm in the S-Plus system to the classification of software modules by the application of our classification rule that accounts for the preferred balance between misclassification rates. We conducted a case study of a very large legacy telecommunications system, and investigated two parameters of the regression-tree algorithm. We found that minimum deviance was strongly related to overfitting and can be used to control it, but the effect of minimum node size on overfitting is ambiguous

Keywords :

pattern classification; program diagnostics; software maintenance; software quality; statistical analysis; subroutines; telecommunication computing; trees (mathematics); S-Plus system; accuracy; case study; classification; fault prediction; fault-prone modules; large legacy telecommunications system; minimum deviance; minimum node size; misclassification cost; misclassification rate; nonmonotonic relationships; overfitting control; program module improvement; regression trees; response variable predictors; software enhancement; software maintenance; software metrics; software quality models; software release cycles; software reliability; training data structure; tree-based models; Application software; Classification tree analysis; Costs; Predictive models; Regression tree analysis; Size control; Software algorithms; Software quality; Telecommunication control; Training data;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Software Metrics Symposium, 2001. METRICS 2001. Proceedings. Seventh International

Conference_Location :

London

ISSN :

1530-1435

Print_ISBN :

0-7695-1043-4

Type :

conf

DOI :

10.1109/METRIC.2001.915528

Filename :

915528

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1744292