Title :
Part-of-speech tagging of program identifiers for improved text-based software engineering tools
Author :
Gupta, Swastik ; Malik, S. ; Pollock, Lori ; Vijay-Shanker, K.
Author_Institution :
Comput. & Inf. Sci., Univ. of Delaware, Newark, DE, USA
Abstract :
To aid program comprehension, programmers choose identifiers for methods, classes, fields and other program elements primarily by following naming conventions in software. These software “naming conventions” follow systematic patterns which can convey deep natural language clues that can be leveraged by software engineering tools. For example, they can be used to increase the accuracy of software search tools, improve the ability of program navigation tools to recommend related methods, and raise the accuracy of other program analyses. After splitting multi-word names into their component words, the next step to extracting accurate natural language information is tagging each word with its part of speech (POS) and then chunking the name into natural language phrases. State-of-theart approaches, most of which rely on “traditional POS taggers” trained on natural language documents, do not capture the syntactic structure of program elements. In this paper, we present a POS tagger and syntactic chunker for source code names that takes into account programmers´ naming conventions to understand the regular, systematic ways a program element is named. We studied the naming conventions used in Object Oriented Programming and identified different grammatical constructions that characterize a large number of program identifiers. This study then informed the design of our POS tagger and chunker. Our evaluation results show a significant improvement in accuracy(11%-20%) of POS tagging of identifiers, over the current approaches. With this improved accuracy, both automated software engineering tools and developers will be able to better capture and understand the information available in code.
Keywords :
natural language processing; object-oriented programming; reverse engineering; software engineering; text analysis; POS tagger; class identifier; component words; deep natural language clues; field identifier; grammatical construction; method identifier; multiword name splitting; natural language documents; natural language information extraction; natural language phrases; object oriented programming; part of speech; part-of-speech tagging; program analysis; program comprehension; program elements; program identifiers; program navigation tools; programmer naming convention; software naming conventions; software search tools; source code names; syntactic chunker; syntactic structure; text-based software engineering tools; word tagging; Accuracy; Context; Natural languages; Software; Software engineering; Syntactics; Tagging; Program understanding; comprehension; identifiers; natural language processing; part-of-speech;
Conference_Titel :
Program Comprehension (ICPC), 2013 IEEE 21st International Conference on
Conference_Location :
San Francisco, CA
DOI :
10.1109/ICPC.2013.6613828