• DocumentCode
    1905467
  • Title

    Author Identification in Imbalanced Sets of Source Code Samples

  • Author

    Chatzicharalampous, E. ; Frantzeskou, G. ; Stamatatos, E.

  • Author_Institution
    Dept. of Inf. & Commun. Syst. Eng., Univ. of the Aegean, Karlovassi, Greece
  • Volume
    1
  • fYear
    2012
  • fDate
    7-9 Nov. 2012
  • Firstpage
    790
  • Lastpage
    797
  • Abstract
    Similarly to natural language texts, source code documents can be distinguished by their style. Source code author identification can be viewed as a text classification task given that samples of known authorship by a set of candidate authors are available. Although very promising results have been reported for this task, the evaluation of existing approaches avoids focusing on the class imbalance problem and its effect on the performance. In this paper, we present a systematic experimental study of author identification in skewed training sets where the training samples are unequally distributed over the candidate authors. Two representative author identification methods are examined, one follows the profile-based paradigm (where a single representation is produced for all the available training samples per author) and the other follows the instance-based paradigm (where each training sample has its own individual representation). We examine the effect of the source code representation on the performance of these methods and show that the profile-based method is better able to handle cases of highly skewed training sets while the instance-based method is a better choice in balanced or slightly-skewed training sets.
  • Keywords
    natural language processing; pattern classification; source coding; text analysis; class imbalance problem; imbalanced source code sample sets; instance-based paradigm; natural language texts; profile-based paradigm; skewed training sets; source code author identification; source code documents; text classification task; Forensics; Measurement; Natural languages; Software; Support vector machines; Text categorization; Training; Source code author identification; byte-level n-grams; class imbalance;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Tools with Artificial Intelligence (ICTAI), 2012 IEEE 24th International Conference on
  • Conference_Location
    Athens
  • ISSN
    1082-3409
  • Print_ISBN
    978-1-4799-0227-9
  • Type

    conf

  • DOI
    10.1109/ICTAI.2012.112
  • Filename
    6495124