DocumentCode
1905467
Title
Author Identification in Imbalanced Sets of Source Code Samples
Author
Chatzicharalampous, E. ; Frantzeskou, G. ; Stamatatos, E.
Author_Institution
Dept. of Inf. & Commun. Syst. Eng., Univ. of the Aegean, Karlovassi, Greece
Volume
1
fYear
2012
fDate
7-9 Nov. 2012
Firstpage
790
Lastpage
797
Abstract
Similarly to natural language texts, source code documents can be distinguished by their style. Source code author identification can be viewed as a text classification task given that samples of known authorship by a set of candidate authors are available. Although very promising results have been reported for this task, the evaluation of existing approaches avoids focusing on the class imbalance problem and its effect on the performance. In this paper, we present a systematic experimental study of author identification in skewed training sets where the training samples are unequally distributed over the candidate authors. Two representative author identification methods are examined, one follows the profile-based paradigm (where a single representation is produced for all the available training samples per author) and the other follows the instance-based paradigm (where each training sample has its own individual representation). We examine the effect of the source code representation on the performance of these methods and show that the profile-based method is better able to handle cases of highly skewed training sets while the instance-based method is a better choice in balanced or slightly-skewed training sets.
Keywords
natural language processing; pattern classification; source coding; text analysis; class imbalance problem; imbalanced source code sample sets; instance-based paradigm; natural language texts; profile-based paradigm; skewed training sets; source code author identification; source code documents; text classification task; Forensics; Measurement; Natural languages; Software; Support vector machines; Text categorization; Training; Source code author identification; byte-level n-grams; class imbalance;
fLanguage
English
Publisher
ieee
Conference_Titel
Tools with Artificial Intelligence (ICTAI), 2012 IEEE 24th International Conference on
Conference_Location
Athens
ISSN
1082-3409
Print_ISBN
978-1-4799-0227-9
Type
conf
DOI
10.1109/ICTAI.2012.112
Filename
6495124
Link To Document