DocumentCode :
2453215
Title :
Ctcompare: Code clone detection using hashed token sequences
Author :
Toomey, Warren
Author_Institution :
Sch. of IT, Bond Univ., Robina, QLD, Australia
fYear :
2012
fDate :
4-4 June 2012
Firstpage :
92
Lastpage :
93
Abstract :
There is much research on the use of tokenized source code to find code clones both within and between trees of source code. Some approaches have used suffix trees [1], [3]; others have used variations of longest common substring algorithms [4], [5]. This paper outlines an algorithm, embodied in a new tool called ctcompare, that takes a different tokenization approach. Each code base to be compared is first lexically analysed to produce a sequence of tokens. These are then broken into overlapping tuples of N consecutive tokens. The tuples are then hashed and the hash values of token tuples are used to identify type-1 and type-2 clone pairs. Hashed token sequences combined with a database have already been used in earlier ctcompare versions and elsewhere [2], but with a significant performance penalty due to database insertions. The benefits of this approach over the existing research include the simultaneous comparison of multiple large code bases and fast absolute performance.
Keywords :
cryptography; source coding; trees (mathematics); code clone detection; ctcompare; hashed token sequences; suffix trees; tokenized source code; Algorithm design and analysis; Australia; Cloning; Databases; Educational institutions; Redundancy; Time measurement; clone detection; code clone; code redundancy; hash function; software;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Software Clones (IWSC), 2012 6th International Workshop on
Conference_Location :
Zurich
Print_ISBN :
978-1-4673-1794-8
Type :
conf
DOI :
10.1109/IWSC.2012.6227881
Filename :
6227881
Link To Document :
بازگشت