Title :
A parallel and efficient approach to large scale clone detection
Author :
Sajnani, Hitesh ; Lopes, Cristiano
Author_Institution :
Donald Bren Sch. of Inf. & Comput. Sci., Univ. of California, Irvine, Irvine, CA, USA
Abstract :
Over the past few years, researchers have implemented various algorithms to improve the scalability of clone detection. Most of these algorithms focus on scaling vertically on a single machine, and require complex intermediate data structures (e.g., suffix tree, etc.). However, several new use-cases of clone detection have emerged, which are beyond the computational capacity of a single machine. Moreover, for some of these usecases it may be expensive to invest upfront in the cost of building these data structures. In this paper, we propose a technique to horizontally scale clone detection across multiple machines using the popular MapReduce framework. The technique does not require building any complex intermediate data structures. Moreover, in order to increase the efficiency, the technique uses a filtering heuristic to prune the number of code block comparisons. The filtering heuristic is independent of our approach and it can be used in conjunction with other approaches to increase their efficiency. In our experiments, we found that: (i) the computation time to detect clones decreases by almost half every time we double the number of nodes; and (ii) the scaleup is linear, with a decline of not more than 70% compared to the ideal case, on a cluster of 2-32 nodes for 150-2800 projects.
Keywords :
data structures; software maintenance; MapReduce framework; clone detection scalability; complex intermediate data structures; filtering heuristic; horizontally scale clone detection; single machine computational capacity; software maintenance; Availability; Buildings; Cloning; Companies; Data structures; Frequency measurement; Indexes;
Conference_Titel :
Software Clones (IWSC), 2013 7th International Workshop on
Conference_Location :
San Francisco, CA
DOI :
10.1109/IWSC.2013.6613042