Title :
Query by example in large-scale code repositories
Author :
Vipin Balachandran
Author_Institution :
VMware, Bangalore, India
Abstract :
Searching code samples in a code repository is an important part of program comprehension. Most of the existing tools for code search support syntactic element search and regular expression pattern search. However, they are text-based and hence cannot handle queries which are syntactic patterns. The proposed solutions for querying syntactic patterns using specialized query languages present a steep learning curve for users. The querying would be more user-friendly if the syntactic pattern can be formulated in the underlying programming language (as a sample code snippet) instead of a specialized query language. In this paper, we propose a solution for the query by example problem using Abstract Syntax Tree (AST) structural similarity match. The query snippet is converted to an AST, then its subtrees are compared against AST subtrees of source files in the repository and the similarity values of matching subtrees are aggregated to arrive at a relevance score for each of the source files. To scale this approach to large code repositories, we use locality-sensitive hash functions and numerical vector approximation of trees. Our experimental evaluation involves running control queries against a real project. The results show that our algorithm can achieve high precision (0.73) and recall (0.81) and scale to large code repositories without compromising quality.
Keywords :
"Vegetation","Syntactics","Euclidean distance","Approximation methods","Search engines","Java","Approximation algorithms"
Conference_Titel :
Software Maintenance and Evolution (ICSME), 2015 IEEE International Conference on
DOI :
10.1109/ICSM.2015.7332498