DocumentCode
251919
Title
Automated construction of a software-specific word similarity database
Author
Yuan Tian ; Lo, Daniel ; Lawall, Julia
Author_Institution
Singapore Manage. Univ., Singapore, Singapore
fYear
2014
fDate
3-6 Feb. 2014
Firstpage
44
Lastpage
53
Abstract
Many automated software engineering approaches, including code search, bug report categorization, and duplicate bug report detection, measure similarities between two documents by analyzing natural language contents. Often different words are used to express the same meaning and thus measuring similarities using exact matching of words is insufficient. To solve this problem, past studies have shown the need to measure the similarities between pairs of words. To meet this need, the natural language processing community has built WordNet which is a manually constructed lexical database that records semantic relations among words and can be used to measure how similar two words are. However, WordNet is a general purpose resource, and often does not contain software-specific words. Also, the meanings of words in WordNet are often different than when they are used in software engineering context. Thus, there is a need for a software-specific WordNet-like resource that can measure similarities of words. In this work, we propose an automated approach that builds a software-specific WordNet like resource, named WordSimSEDB, by leveraging the textual contents of posts in StackOverflow. Our approach measures the similarity of words by computing the similarities of the weighted co-occurrences of these words with three types of words in the textual corpus. We have evaluated our approach on a set of software-specific words and compared our approach with an existing WordNet-based technique (WordNetres) to return top-k most similar words. Human judges are used to evaluate the effectiveness of the two techniques. We find that WordNetres returns no result for 55 % of the queries. For the remaining queries, WordNetres returns significantly poorer results.
Keywords
database management systems; natural language processing; program debugging; software engineering; StackOverflow; WordNet; WordSimSEDB; automated construction; automated software engineering; bug report categorization; code search; duplicate bug report detection; lexical database; natural language contents; natural language processing; software-specific word similarity database; Java; Measurement; Semantics; Software; Software engineering; Tuning; Vectors;
fLanguage
English
Publisher
ieee
Conference_Titel
Software Maintenance, Reengineering and Reverse Engineering (CSMR-WCRE), 2014 Software Evolution Week - IEEE Conference on
Conference_Location
Antwerp
Type
conf
DOI
10.1109/CSMR-WCRE.2014.6747213
Filename
6747213
Link To Document