Title :
A Local-Concentration-Based Feature Extraction Approach for Spam Filtering
Author :
Zhu, Yuanchun ; Tan, Ying
Author_Institution :
Dept. of Machine, Peking Univ., Beijing, China
fDate :
6/1/2011 12:00:00 AM
Abstract :
Inspired from the biological immune system, we propose a local concentration (LC)-based feature extraction approach for anti-spam. The LC approach is considered to be able to effectively extract position-correlated information from messages by transforming each area of a message to a corresponding LC feature. Two implementation strategies of the LC approach are designed using a fixed-length sliding window and a variable-length sliding window. To incorporate the LC approach into the whole process of spam filtering, a generic LC model is designed. In the LC model, two types of detector sets are at first generated by using term selection methods and a well-defined tendency threshold. Then a sliding window is adopted to divide the message into individual areas. After segmentation of the message, the concentration of detectors is calculated and taken as the feature for each local area. Finally, all the features of local areas are combined as a feature vector of the message. To evaluate the proposed LC model, several experiments are conducted on five benchmark corpora using the cross-validation method. It is shown that the LC approach cooperates well with three term selection methods, which endows it with flexible applicability in the real world. Compared to the global-concentration-based approach and the prevalent bag-of-words approach, the LC approach has better performance in terms of both accuracy and F1 measure. It is also demonstrated that the LC approach is robust against messages with variable message length.
Keywords :
artificial immune systems; feature extraction; unsolicited e-mail; LC model; biological immune system; cross-validation method; feature vector; fixed-length sliding window; local-concentration-based feature extraction approach; position-correlated information extraction; spam filtering; tendency threshold; term selection methods; variable-length sliding window; Accuracy; Artificial neural networks; Electronic mail; Feature extraction; Immune system; Productivity; Training; Artificial immune system (AIS); bag-of-words (BoW); feature extraction; global concentration (GC); local concentration (LC); spam filtering;
Journal_Title :
Information Forensics and Security, IEEE Transactions on
DOI :
10.1109/TIFS.2010.2103060