Author Identification in Imbalanced Sets of Source Code Samples

Author

Chatzicharalampous, E. ; Frantzeskou, G. ; Stamatatos, E.

Author_Institution

Dept. of Inf. & Commun. Syst. Eng., Univ. of the Aegean, Karlovassi, Greece

Volume

1

fYear

2012

fDate

7-9 Nov. 2012

Firstpage

790

Lastpage

797

Abstract

Similarly to natural language texts, source code documents can be distinguished by their style. Source code author identification can be viewed as a text classification task given that samples of known authorship by a set of candidate authors are available. Although very promising results have been reported for this task, the evaluation of existing approaches avoids focusing on the class imbalance problem and its effect on the performance. In this paper, we present a systematic experimental study of author identification in skewed training sets where the training samples are unequally distributed over the candidate authors. Two representative author identification methods are examined, one follows the profile-based paradigm (where a single representation is produced for all the available training samples per author) and the other follows the instance-based paradigm (where each training sample has its own individual representation). We examine the effect of the source code representation on the performance of these methods and show that the profile-based method is better able to handle cases of highly skewed training sets while the instance-based method is a better choice in balanced or slightly-skewed training sets.

Keywords

natural language processing; pattern classification; source coding; text analysis; class imbalance problem; imbalanced source code sample sets; instance-based paradigm; natural language texts; profile-based paradigm; skewed training sets; source code author identification; source code documents; text classification task; Forensics; Measurement; Natural languages; Software; Support vector machines; Text categorization; Training; Source code author identification; byte-level n-grams; class imbalance;

fLanguage

English

Publisher

ieee

Conference_Titel

Tools with Artificial Intelligence (ICTAI), 2012 IEEE 24th International Conference on

Conference_Location

Athens

ISSN

1082-3409

Print_ISBN

978-1-4799-0227-9

Type

conf

DOI

10.1109/ICTAI.2012.112

Filename

6495124