DocumentCode :
57027
Title :
Sceadan: Using Concatenated N-Gram Vectors for Improved File and Data Type Classification
Author :
Beebe, Nicole L. ; Maddox, Laurence A. ; Lishu Liu ; Minghe Sun
Author_Institution :
Inf. Syst. & Cyber Security Dept., Univ. of Texas at San Antonio, San Antonio, TX, USA
Volume :
8
Issue :
9
fYear :
2013
fDate :
Sept. 2013
Firstpage :
1519
Lastpage :
1530
Abstract :
Over 20 studies have been published in the past decade involving file and data type classification for digital forensics and information security applications. Methods using n-grams as inputs have proven the most successful across a wide variety of types; however, there are mixed results regarding the utility of unigrams and bigrams as inputs independently. In this study, we use support vector machines (SVMs) consisting of unigrams and bigrams, as well as complexity and other byte frequency-based measures, as inputs. Using concatenated unigrams and bigrams as input and a linear kernel SVM, we achieve significantly improved results over those previously reported (73.4% classification rate across 38 file and data types). We are the first to use concatenated n-grams as the sole input, and we show their superiority over inputs used previously. We also found that too many different types of features as inputs result in overfitting and poor generalization properties. We include several types seldom or not studied in the past (Microsoft Office 2010 files, file system data, base64, base85, URL encoding, flash video, M4A, MP4, WMV, and JSON records). The “winning” approach is instantiated in an open source software tool called Sceadan - Systematic Classification Engine for Advanced Data ANalysis.
Keywords :
computational complexity; data analysis; digital forensics; file organisation; pattern classification; public domain software; software tools; support vector machines; Sceadan-systematic classification engine; advanced data analysis; byte frequency-based measure; complexity; concatenated N-gram vector; concatenated bigram; concatenated unigram; data type classification; digital forensics; file type classification; information security application; linear kernel SVM; open source software tool; support vector machine; winning approach; Classification algorithms; Complexity theory; Frequency measurement; Kernel; Support vector machine classification; Training; Data type classification; digital forensics; file type classification; n-gram; support vector machine;
fLanguage :
English
Journal_Title :
Information Forensics and Security, IEEE Transactions on
Publisher :
ieee
ISSN :
1556-6013
Type :
jour
DOI :
10.1109/TIFS.2013.2274728
Filename :
6567922
Link To Document :
بازگشت