• DocumentCode
    2543837
  • Title

    A Lightweight Approach to Uncover Technical Artifacts in Unstructured Data

  • Author

    Bettenburg, Nicolas ; Adams, Bram ; Hassan, Ahmed E. ; Smidt, Michel

  • Author_Institution
    Software Anal. & Intell. Lab., Queen´´s Univ., Kingston, ON, Canada
  • fYear
    2011
  • fDate
    22-24 June 2011
  • Firstpage
    185
  • Lastpage
    188
  • Abstract
    Developer communication through email, chat, or issue report comments consists mostly of largely unstructured data, i.e., natural language text, mixed with technical artifacts such as project-specific jargon, abbreviations, source code patches, stack traces and identifiers. These technical artifacts represent a valuable source of knowledge on the technical part of the system, with a wide range of applications from establishing traceability links to creating project-specific vocabularies. However, the lack of well-defined boundaries between natural language and technical content make the automated mining of technical artifacts challenging. As a first step towards a general-purpose technique to extracting technical artifacts from unstructured data, we present a lightweight approach to untangle technical artifacts and natural language text. Our approach is based on existing spell checking tools, which are well-understood, fast, readily available across platforms and impartial to different kinds of textual data. Through a handcrafted benchmark, we demonstrate that our approach is able to successfully uncover a wide range of technical artifacts in unstructured data.
  • Keywords
    data mining; data structures; natural language processing; program diagnostics; software engineering; text analysis; vocabulary; abbreviations; identifiers; natural language text; project-specific jargon; project-specific vocabularies; source code patches; spell checking tools; stack traces; technical artifacts mining; traceability links; unstructured data; Benchmark testing; Conferences; Data mining; Electronic mail; IEEE Computer Society; Natural languages; Software; language analysis; technical artifacts; text mining; unstructured data;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Program Comprehension (ICPC), 2011 IEEE 19th International Conference on
  • Conference_Location
    Kingston, ON
  • ISSN
    1092-8138
  • Print_ISBN
    978-1-61284-308-7
  • Electronic_ISBN
    1092-8138
  • Type

    conf

  • DOI
    10.1109/ICPC.2011.36
  • Filename
    5970153