Title :
On Some Feature Selection Strategies for Spam Filter Design
Author :
Wang, Ruiqi ; Youssef, Amr M. ; Elhakeem, Ahmed K.
Author_Institution :
Dept. of Electr. & Comput. Eng., Concordia Univ., Montreal, Que.
Abstract :
Feature selection is an important research problem in different statistical learning problems including text categorization applications such as spam email classification. In designing spam filters, we often represent the email by vector space model (VSM), i.e., every email is considered as a vector of word terms. Since there are many different terms in the email, and not all classifiers can handle such a high dimension, only the most powerful discriminatory terms should be used. Another reason is that some of these features may not be influential and might carry redundant information which may confuse the classifier. Thus, feature selection, and hence dimensionality reduction, is a crucial step to get the best out of the constructed features. There are many feature selection strategies that can be applied to produce the resulting feature set. In this paper, we investigate the use of hill climbing, simulated annealing, and threshold accepting optimization techniques as feature selection algorithms. We also compare the performance of the above three techniques with the linear discriminate analysis. Our experiment results show that all these techniques can be used not only to reduce the dimensions of the e-mail, but also improve the performance of the classification filter. Among all the strategies, simulated annealing has the best performance which reaches a classification accuracy of 95.5%
Keywords :
electronic mail; filtering theory; learning (artificial intelligence); signal classification; simulated annealing; feature selection strategy; hill climbing; linear discriminate analysis; simulated annealing; spam email classification; spam filter design; statistical learning; threshold accepting optimization technique; vector space model; Costs; Design engineering; Information filtering; Information filters; Information systems; Performance analysis; Simulated annealing; Systems engineering and theory; Text categorization; Unsolicited electronic mail; Hill Climbing; Linear Discriminate Analysis; Simulated Annealing; Spam email filter; Threshold Accepting; feature selection;
Conference_Titel :
Electrical and Computer Engineering, 2006. CCECE '06. Canadian Conference on
Conference_Location :
Ottawa, Ont.
Print_ISBN :
1-4244-0038-4
Electronic_ISBN :
1-4244-0038-4
DOI :
10.1109/CCECE.2006.277770