• DocumentCode
    3137470
  • Title

    On Some Feature Selection Strategies for Spam Filter Design

  • Author

    Wang, Ruiqi ; Youssef, Amr M. ; Elhakeem, Ahmed K.

  • Author_Institution
    Dept. of Electr. & Comput. Eng., Concordia Univ., Montreal, Que.
  • fYear
    2006
  • fDate
    38838
  • Firstpage
    2186
  • Lastpage
    2189
  • Abstract
    Feature selection is an important research problem in different statistical learning problems including text categorization applications such as spam email classification. In designing spam filters, we often represent the email by vector space model (VSM), i.e., every email is considered as a vector of word terms. Since there are many different terms in the email, and not all classifiers can handle such a high dimension, only the most powerful discriminatory terms should be used. Another reason is that some of these features may not be influential and might carry redundant information which may confuse the classifier. Thus, feature selection, and hence dimensionality reduction, is a crucial step to get the best out of the constructed features. There are many feature selection strategies that can be applied to produce the resulting feature set. In this paper, we investigate the use of hill climbing, simulated annealing, and threshold accepting optimization techniques as feature selection algorithms. We also compare the performance of the above three techniques with the linear discriminate analysis. Our experiment results show that all these techniques can be used not only to reduce the dimensions of the e-mail, but also improve the performance of the classification filter. Among all the strategies, simulated annealing has the best performance which reaches a classification accuracy of 95.5%
  • Keywords
    electronic mail; filtering theory; learning (artificial intelligence); signal classification; simulated annealing; feature selection strategy; hill climbing; linear discriminate analysis; simulated annealing; spam email classification; spam filter design; statistical learning; threshold accepting optimization technique; vector space model; Costs; Design engineering; Information filtering; Information filters; Information systems; Performance analysis; Simulated annealing; Systems engineering and theory; Text categorization; Unsolicited electronic mail; Hill Climbing; Linear Discriminate Analysis; Simulated Annealing; Spam email filter; Threshold Accepting; feature selection;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Electrical and Computer Engineering, 2006. CCECE '06. Canadian Conference on
  • Conference_Location
    Ottawa, Ont.
  • Print_ISBN
    1-4244-0038-4
  • Electronic_ISBN
    1-4244-0038-4
  • Type

    conf

  • DOI
    10.1109/CCECE.2006.277770
  • Filename
    4054718