DocumentCode :
652630
Title :
Learning from Open-Source Projects: An Empirical Study on Defect Prediction
Author :
Zhimin He ; Peters, F. ; Menzies, T. ; Ye Yang
Author_Institution :
Lab. for Internet Software Technol., Inst. of Software, Beijing, China
fYear :
2013
fDate :
10-11 Oct. 2013
Firstpage :
45
Lastpage :
54
Abstract :
The fundamental issue in cross project defect prediction is selecting the most appropriate training data for creating quality defect predictors. Another concern is whether historical data of open-source projects can be used to create quality predictors for proprietary projects from a practical point-of-view. Current studies have proposed statistical approaches to finding these training data, however, thus far no apparent effort has been made to study their success on proprietary data. Also these methods apply brute force techniques which are computationally expensive. In this work we introduce a novel data selection procedure which takes into account the similarities between the distribution of the test and potential training data. Additionally we use feature subset selection to increase the similarity between the test and training sets. Our procedure provides a comparable and scalable means of solving the cross project defect prediction problem for creating quality defect predictors. To evaluate our procedure we conducted empirical studies with comparisons to the within company defect prediction and a relevancy filtering method. We found that our proposed method performs relatively better than the filtering method in terms of both computation cost and prediction performance.
Keywords :
learning (artificial intelligence); program debugging; project management; public domain software; statistical analysis; brute force techniques; company defect prediction; computation cost; cross project defect prediction problem; data selection procedure; feature subset selection; open-source project learning; prediction performance; proprietary data; proprietary projects; quality defect predictor creation; relevancy filtering method; statistical approach; test distribution; test-training set similarity; training data; Data models; Filtering; Open source software; Predictive models; Training; Training data; cross-project; data similarity; feature subset selection; instance selection; software defect prediction;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Empirical Software Engineering and Measurement, 2013 ACM / IEEE International Symposium on
Conference_Location :
Baltimore, MD
ISSN :
1938-6451
Print_ISBN :
978-0-7695-5056-5
Type :
conf
DOI :
10.1109/ESEM.2013.20
Filename :
6681337
Link To Document :
بازگشت