مرکز منطقه ای اطلاع رساني علوم و فناوري - Classifying Spam Emails Using Text and Readability Features

DocumentCode :

679534

Title :

Classifying Spam Emails Using Text and Readability Features

Author :

Shams, Reza ; Mercer, Robert E.

Author_Institution :

Dept. of Comput. Sci., Univ. of Western Ontario, London, ON, Canada

fYear :

2013

fDate :

7-10 Dec. 2013

Firstpage :

657

Lastpage :

666

Abstract :

Supervised machine learning methods for classifying spam emails are long-established. Most of these methods use either header-based or content-based features. Spammers, however, can bypass these methods easily-especially the ones that deal with header features. In this paper, we report a novel spam classification method that uses features based on email content-language and readability combined with the previously used content-based task features. The features are extracted from four benchmark datasets viz. CSDMC2010, Spam Assassin, Ling Spam, and Enron-Spam. We use five well-known algorithms to induce our spam classifiers: Random Forest (RF), BAGGING, ADABOOSTM1, Support Vector Machine (SVM), and Naïve Bayes (NB). We evaluate the classifier performances and find that BAGGING performs the best. Moreover, its performance surpasses that of a number of state-of-the-art methods proposed in previous studies. Although applied only to English language emails, the results indicate that our method may be an excellent means to classify spam emails in other languages, as well.

Keywords :

Bayes methods; feature extraction; learning (artificial intelligence); pattern classification; support vector machines; text analysis; unsolicited e-mail; AdaboosTM1; CSDMC2010; English language emails; Enron-spam; Ling spam; NB; Naïve Bayes; RF; SVM; bagging; benchmark datasets; content-based task features; email content-language; feature extraction; header-based features; random forest; readability features; spam assassin; spam classification method; spam classifiers; spam email classification; spammers; supervised machine learning methods; support vector machine; text features; Bagging; HTML; Indexes; Radio frequency; Support vector machines; Unsolicited electronic mail; Spam classification; anti-spam filter; feature importance; machine-learning application; performance evaluation; readability features; text categorization; text features;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Data Mining (ICDM), 2013 IEEE 13th International Conference on

Conference_Location :

Dallas, TX

ISSN :

1550-4786

Type :

conf

DOI :

10.1109/ICDM.2013.131

Filename :

6729550

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=679534