DocumentCode
1952969
Title
A Hidden Markov Model to detect coded information islands in free text
Author
Cerulo, L. ; Ceccarelli, Marco ; Di Penta, Massimiliano ; Canfora, Gerardo
Author_Institution
Dept. of Sci. & Technol, Univ. of Sannio, Benevento, Italy
fYear
2013
fDate
22-23 Sept. 2013
Firstpage
157
Lastpage
166
Abstract
Emails and issue reports capture useful knowledge about development practices, bug fixing, and change activities. Extracting such a content is challenging, due to the mix-up of source code and natural language, unstructured text. In this paper we introduce an approach, based on Hidden Markov Models (HMMs), to extract coded information islands, such as source code, stack traces, and patches, from free text at a token level of granularity. We train a HMM for each category of information contained in the text, and adopt the Viterbi algorithm to recognize whether the sequence of tokens-e.g., words, language keywords, numbers, parentheses, punctuation marks, etc.-observed in a text switches among those HMMs. Although our implementation focuses on extracting source code from emails, the approach could be easily extended to include in principle any text-interleaved language. We evaluated our approach with respect to the state of art on a set of development emails and bug reports drawn from the software repositories of well known open source systems. Results indicate an accuracy between 82% and 99%, which is in line with existing approaches which, differently from ours, require the manual definition of regular expressions or parsers.
Keywords
hidden Markov models; information retrieval; natural language processing; text analysis; HMM; Viterbi algorithm; coded information island extraction; emails; free text; hidden Markov model; issue reports; natural language; open source systems; software repositories; source code; text-interleaved language; unstructured text; Accuracy; Context; Electronic mail; Hidden Markov models; Markov processes; Natural languages; Syntactics; HMM; Mailing list mining; Natural Language Parsing;
fLanguage
English
Publisher
ieee
Conference_Titel
Source Code Analysis and Manipulation (SCAM), 2013 IEEE 13th International Working Conference on
Conference_Location
Eindhoven
Type
conf
DOI
10.1109/SCAM.2013.6648197
Filename
6648197
Link To Document