DocumentCode
3137470
Title
On Some Feature Selection Strategies for Spam Filter Design
Author
Wang, Ruiqi ; Youssef, Amr M. ; Elhakeem, Ahmed K.
Author_Institution
Dept. of Electr. & Comput. Eng., Concordia Univ., Montreal, Que.
fYear
2006
fDate
38838
Firstpage
2186
Lastpage
2189
Abstract
Feature selection is an important research problem in different statistical learning problems including text categorization applications such as spam email classification. In designing spam filters, we often represent the email by vector space model (VSM), i.e., every email is considered as a vector of word terms. Since there are many different terms in the email, and not all classifiers can handle such a high dimension, only the most powerful discriminatory terms should be used. Another reason is that some of these features may not be influential and might carry redundant information which may confuse the classifier. Thus, feature selection, and hence dimensionality reduction, is a crucial step to get the best out of the constructed features. There are many feature selection strategies that can be applied to produce the resulting feature set. In this paper, we investigate the use of hill climbing, simulated annealing, and threshold accepting optimization techniques as feature selection algorithms. We also compare the performance of the above three techniques with the linear discriminate analysis. Our experiment results show that all these techniques can be used not only to reduce the dimensions of the e-mail, but also improve the performance of the classification filter. Among all the strategies, simulated annealing has the best performance which reaches a classification accuracy of 95.5%
Keywords
electronic mail; filtering theory; learning (artificial intelligence); signal classification; simulated annealing; feature selection strategy; hill climbing; linear discriminate analysis; simulated annealing; spam email classification; spam filter design; statistical learning; threshold accepting optimization technique; vector space model; Costs; Design engineering; Information filtering; Information filters; Information systems; Performance analysis; Simulated annealing; Systems engineering and theory; Text categorization; Unsolicited electronic mail; Hill Climbing; Linear Discriminate Analysis; Simulated Annealing; Spam email filter; Threshold Accepting; feature selection;
fLanguage
English
Publisher
ieee
Conference_Titel
Electrical and Computer Engineering, 2006. CCECE '06. Canadian Conference on
Conference_Location
Ottawa, Ont.
Print_ISBN
1-4244-0038-4
Electronic_ISBN
1-4244-0038-4
Type
conf
DOI
10.1109/CCECE.2006.277770
Filename
4054718
Link To Document