DocumentCode
2543837
Title
A Lightweight Approach to Uncover Technical Artifacts in Unstructured Data
Author
Bettenburg, Nicolas ; Adams, Bram ; Hassan, Ahmed E. ; Smidt, Michel
Author_Institution
Software Anal. & Intell. Lab., Queen´´s Univ., Kingston, ON, Canada
fYear
2011
fDate
22-24 June 2011
Firstpage
185
Lastpage
188
Abstract
Developer communication through email, chat, or issue report comments consists mostly of largely unstructured data, i.e., natural language text, mixed with technical artifacts such as project-specific jargon, abbreviations, source code patches, stack traces and identifiers. These technical artifacts represent a valuable source of knowledge on the technical part of the system, with a wide range of applications from establishing traceability links to creating project-specific vocabularies. However, the lack of well-defined boundaries between natural language and technical content make the automated mining of technical artifacts challenging. As a first step towards a general-purpose technique to extracting technical artifacts from unstructured data, we present a lightweight approach to untangle technical artifacts and natural language text. Our approach is based on existing spell checking tools, which are well-understood, fast, readily available across platforms and impartial to different kinds of textual data. Through a handcrafted benchmark, we demonstrate that our approach is able to successfully uncover a wide range of technical artifacts in unstructured data.
Keywords
data mining; data structures; natural language processing; program diagnostics; software engineering; text analysis; vocabulary; abbreviations; identifiers; natural language text; project-specific jargon; project-specific vocabularies; source code patches; spell checking tools; stack traces; technical artifacts mining; traceability links; unstructured data; Benchmark testing; Conferences; Data mining; Electronic mail; IEEE Computer Society; Natural languages; Software; language analysis; technical artifacts; text mining; unstructured data;
fLanguage
English
Publisher
ieee
Conference_Titel
Program Comprehension (ICPC), 2011 IEEE 19th International Conference on
Conference_Location
Kingston, ON
ISSN
1092-8138
Print_ISBN
978-1-61284-308-7
Electronic_ISBN
1092-8138
Type
conf
DOI
10.1109/ICPC.2011.36
Filename
5970153
Link To Document