• DocumentCode
    700370
  • Title

    Threshold-free code clone detection for a large-scale heterogeneous Java repository

  • Author

    Keivanloo, Iman ; Feng Zhang ; Ying Zou

  • Author_Institution
    Dept. of Electr. & Comput. Eng., Queen´s Univ., Kingston, ON, Canada
  • fYear
    2015
  • fDate
    2-6 March 2015
  • Firstpage
    201
  • Lastpage
    210
  • Abstract
    Code clones are unavoidable entities in software ecosystems. A variety of clone-detection algorithms are available for finding code clones. For Type-3 clone detection at method granularity (i.e., similar methods with changes in statements), dissimilarity threshold is one of the possible configuration parameters. Existing approaches use a single threshold to detect Type-3 clones across a repository. However, our study shows that to detect Type-3 clones at method granularity on a large-scale heterogeneous repository, multiple thresholds are often required. We find that the performance of clone detection improves if selecting different thresholds for various groups of clones in a heterogeneous repository (i.e., various applications). In this paper, we propose a threshold-free approach to detect Type-3 clones at method granularity across a large number of applications. Our approach uses an unsupervised learning algorithm, i.e., k-means, to determine true and false clones. We use a clone benchmark with 330,840 tagged clones from 24,824 open source Java projects for our study. We observe that our approach improves the performance significantly by 12% in terms of F-measure. Furthermore, our threshold-free approach eliminates the concern of practitioners about possible misconfiguration of Type-3 clone detection tools.
  • Keywords
    Java; public domain software; F-measure; clone benchmark; dissimilarity threshold; heterogeneous Java repository; method granularity; open source Java projects; software ecosystems; threshold-free code clone detection algorithms; type-3 clone detection tools; Benchmark testing; Cloning; Clustering algorithms; Google; Java; Optimization methods; Software systems; clone detection; clone search; clustering; large-scale repository; threshold-free; unsupervised learning;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Software Analysis, Evolution and Reengineering (SANER), 2015 IEEE 22nd International Conference on
  • Conference_Location
    Montreal, QC
  • Type

    conf

  • DOI
    10.1109/SANER.2015.7081830
  • Filename
    7081830