DocumentCode :
1952969
Title :
A Hidden Markov Model to detect coded information islands in free text
Author :
Cerulo, L. ; Ceccarelli, Marco ; Di Penta, Massimiliano ; Canfora, Gerardo
Author_Institution :
Dept. of Sci. & Technol, Univ. of Sannio, Benevento, Italy
fYear :
2013
fDate :
22-23 Sept. 2013
Firstpage :
157
Lastpage :
166
Abstract :
Emails and issue reports capture useful knowledge about development practices, bug fixing, and change activities. Extracting such a content is challenging, due to the mix-up of source code and natural language, unstructured text. In this paper we introduce an approach, based on Hidden Markov Models (HMMs), to extract coded information islands, such as source code, stack traces, and patches, from free text at a token level of granularity. We train a HMM for each category of information contained in the text, and adopt the Viterbi algorithm to recognize whether the sequence of tokens-e.g., words, language keywords, numbers, parentheses, punctuation marks, etc.-observed in a text switches among those HMMs. Although our implementation focuses on extracting source code from emails, the approach could be easily extended to include in principle any text-interleaved language. We evaluated our approach with respect to the state of art on a set of development emails and bug reports drawn from the software repositories of well known open source systems. Results indicate an accuracy between 82% and 99%, which is in line with existing approaches which, differently from ours, require the manual definition of regular expressions or parsers.
Keywords :
hidden Markov models; information retrieval; natural language processing; text analysis; HMM; Viterbi algorithm; coded information island extraction; emails; free text; hidden Markov model; issue reports; natural language; open source systems; software repositories; source code; text-interleaved language; unstructured text; Accuracy; Context; Electronic mail; Hidden Markov models; Markov processes; Natural languages; Syntactics; HMM; Mailing list mining; Natural Language Parsing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Source Code Analysis and Manipulation (SCAM), 2013 IEEE 13th International Working Conference on
Conference_Location :
Eindhoven
Type :
conf
DOI :
10.1109/SCAM.2013.6648197
Filename :
6648197
Link To Document :
بازگشت