DocumentCode :
1905467
Title :
Author Identification in Imbalanced Sets of Source Code Samples
Author :
Chatzicharalampous, E. ; Frantzeskou, G. ; Stamatatos, E.
Author_Institution :
Dept. of Inf. & Commun. Syst. Eng., Univ. of the Aegean, Karlovassi, Greece
Volume :
1
fYear :
2012
fDate :
7-9 Nov. 2012
Firstpage :
790
Lastpage :
797
Abstract :
Similarly to natural language texts, source code documents can be distinguished by their style. Source code author identification can be viewed as a text classification task given that samples of known authorship by a set of candidate authors are available. Although very promising results have been reported for this task, the evaluation of existing approaches avoids focusing on the class imbalance problem and its effect on the performance. In this paper, we present a systematic experimental study of author identification in skewed training sets where the training samples are unequally distributed over the candidate authors. Two representative author identification methods are examined, one follows the profile-based paradigm (where a single representation is produced for all the available training samples per author) and the other follows the instance-based paradigm (where each training sample has its own individual representation). We examine the effect of the source code representation on the performance of these methods and show that the profile-based method is better able to handle cases of highly skewed training sets while the instance-based method is a better choice in balanced or slightly-skewed training sets.
Keywords :
natural language processing; pattern classification; source coding; text analysis; class imbalance problem; imbalanced source code sample sets; instance-based paradigm; natural language texts; profile-based paradigm; skewed training sets; source code author identification; source code documents; text classification task; Forensics; Measurement; Natural languages; Software; Support vector machines; Text categorization; Training; Source code author identification; byte-level n-grams; class imbalance;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Tools with Artificial Intelligence (ICTAI), 2012 IEEE 24th International Conference on
Conference_Location :
Athens
ISSN :
1082-3409
Print_ISBN :
978-1-4799-0227-9
Type :
conf
DOI :
10.1109/ICTAI.2012.112
Filename :
6495124
Link To Document :
بازگشت