Title :
Classifying Spam Emails Using Text and Readability Features
Author :
Shams, Reza ; Mercer, Robert E.
Author_Institution :
Dept. of Comput. Sci., Univ. of Western Ontario, London, ON, Canada
Abstract :
Supervised machine learning methods for classifying spam emails are long-established. Most of these methods use either header-based or content-based features. Spammers, however, can bypass these methods easily-especially the ones that deal with header features. In this paper, we report a novel spam classification method that uses features based on email content-language and readability combined with the previously used content-based task features. The features are extracted from four benchmark datasets viz. CSDMC2010, Spam Assassin, Ling Spam, and Enron-Spam. We use five well-known algorithms to induce our spam classifiers: Random Forest (RF), BAGGING, ADABOOSTM1, Support Vector Machine (SVM), and Naïve Bayes (NB). We evaluate the classifier performances and find that BAGGING performs the best. Moreover, its performance surpasses that of a number of state-of-the-art methods proposed in previous studies. Although applied only to English language emails, the results indicate that our method may be an excellent means to classify spam emails in other languages, as well.
Keywords :
Bayes methods; feature extraction; learning (artificial intelligence); pattern classification; support vector machines; text analysis; unsolicited e-mail; AdaboosTM1; CSDMC2010; English language emails; Enron-spam; Ling spam; NB; Naïve Bayes; RF; SVM; bagging; benchmark datasets; content-based task features; email content-language; feature extraction; header-based features; random forest; readability features; spam assassin; spam classification method; spam classifiers; spam email classification; spammers; supervised machine learning methods; support vector machine; text features; Bagging; HTML; Indexes; Radio frequency; Support vector machines; Unsolicited electronic mail; Spam classification; anti-spam filter; feature importance; machine-learning application; performance evaluation; readability features; text categorization; text features;
Conference_Titel :
Data Mining (ICDM), 2013 IEEE 13th International Conference on
Conference_Location :
Dallas, TX
DOI :
10.1109/ICDM.2013.131