Title :
Feature construction approach for email categorization based on term space partition
Author :
Guyue Mi ; Pengtao Zhang ; Ying Tan
Author_Institution :
Dept. of Machine Intell., Peking Univ., Beijing, China
Abstract :
This paper proposes a novel feature construction approach based on term space partition (TSP) aiming to establish a mechanism to make terms play more sufficient and rational roles in email categorization. Dominant terms and general terms are separated by performing a vertical partition of the original term space with respect to feature selection metrics, while spam terms and ham terms are separated by a transverse partition with respect to class tendency. Strategies for constructing discriminative features, named term ratio and term density, are designed on corresponding subspaces. Motivation and principle of the TSP approach is presented in detail, as well as the implementation. Experiments are conducted on five benchmark corpora using cross-validation to evaluate the proposed TSP approach. Comprehensive experimental results suggest that the TSP approach far outperforms the traditional and most widely used feature construction approach in spam filtering, which is named bag-of-words, in both performance and efficiency. In comparison with the heuristic and state-of-the-art approaches, namely CFC and LC, the proposed TSP approach shows obvious advantage in terms of accuracy and μ1 measure, as well as high precision, which is warmly welcomed in real spam filtering. Furthermore, the TSP approach performs quite similar with CFC in efficiency of processing incoming emails, while much faster than LC. In addition, it is shown that the TSP approach cooperates well with both unsupervised and supervised feature selection metrics, which endows it with flexible capability in the real world.
Keywords :
e-mail filters; feature selection; pattern classification; unsolicited e-mail; CFC; LC; TSP; bag-of-words; concentration based feature construction; discriminative features; email categorization; feature selection metrics; ham terms; local-concentration; spam filtering; spam terms; supervised feature selection metrics; term density; term ratio; term space partition; transverse partition; unsupervised feature selection metrics; Accuracy; Feature extraction; Noise measurement; Unsolicited electronic mail; Vectors;
Conference_Titel :
Neural Networks (IJCNN), The 2013 International Joint Conference on
Conference_Location :
Dallas, TX
Print_ISBN :
978-1-4673-6128-6
DOI :
10.1109/IJCNN.2013.6707020