Automated Configuration Bug Report Prediction Using Text Mining

Author

Xin Xia ; Lo, Daniel ; Weiwei Qiu ; Xingen Wang ; Bo Zhou

Author_Institution

Coll. of Comput. Sci. & Technol., Zhejiang Univ., Hangzhou, China

fYear

2014

fDate

21-25 July 2014

Firstpage

107

Lastpage

116

Abstract

Configuration bugs are one of the dominant causes of software failures. Previous studies show that a configuration bug could cause huge financial losses in a software system. The importance of configuration bugs has attracted various research studies, e.g., To detect, diagnose, and fix configuration bugs. Given a bug report, an approach that can identify whether the bug is a configuration bug could help developers reduce debugging effort. We refer to this problem as configuration bug reports prediction. To address this problem, we develop a new automated framework that applies text mining technologies on the natural-language description of bug reports to train a statistical model on historical bug reports with known labels (i.e., Configuration or non-configuration), and the statistical model is then used to predict a label for a new bug report. Developers could apply our model to automatically predict labels of bug reports to improve their productivity. Our tool first applies feature selection techniques (e.g., Information gain and Chi-square) to pre-process the textual information in bug reports, and then applies various text mining techniques (e.g., Naive Bayes, SVM, naive Bayes multinomial) to build statistical models. We evaluate our solution on 5 bug report datasets including accumulo, activemq, camel, flume, and wicket. We show that naive Bayes multinomial with information gain achieves the best performance. On average across the 5 projects, its accuracy, configuration F-measure and non-configuration F-measure are 0.811, 0.450, and 0.880, respectively. We also compare our solution with the method proposed by Arshad et al. The results show that our proposed approach that uses naive Bayes multinomial with information gain on average improves accuracy, configuration F-measure and non-configuration F-measure scores of Arshad et al.´s method by 8.34%, 103.7%, and 4.24%, respectively.

Keywords

data mining; program debugging; statistical analysis; text analysis; accumulo; activemq; bug detection; bug diagnosis; camel; configuration F-measure; configuration bug report prediction; debugging effort; feature selection techniques; flume; information gain; naive Bayes multinomial; natural-language description; software failure; statistical model; text mining; wicket; Buildings; Computer bugs; Feature extraction; Predictive models; Support vector machines; Text mining; Training; Configuration Bug; Data Mining; Feature Selection;

fLanguage

English

Publisher

ieee

Conference_Titel

Computer Software and Applications Conference (COMPSAC), 2014 IEEE 38th Annual

Conference_Location

Vasteras

Type

conf

DOI

10.1109/COMPSAC.2014.17

Filename

6899207

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=237281