Author_Institution :
Inf. & Sci. Coll., Beijing Language & Culture Univ., Beijing, China
Abstract :
Feature selection plays an important role in text categorization. Automatic feature selection methods such as document frequency thresholds (DF), information gain (IG), mutual information (MI), and so on are commonly applied in text categorization. Existing experiments show IG is one of the most effective methods. In this paper, a feature selection method is proposed based on Rough Set theory and according to Rough set theory, knowledge about a universe of objects may be defined as classifications based on certain properties of the objects, i.e. rough set theory assume that knowledge is an ability to partition objects. We quantify the ability of classify objects, and call the amount of this ability as knowledge quantity and then following this quantification we put forward a notion ”knowledge Gain” and propose a knowledge gain feature selection method (KG method)The task of spam filtering can be seen as a special problem of text classification. An effective and efficient feature selection method is important, which can be easily and effectively select the major features to attain the goal for anti-spam filtering. We explore 2 classifiers (Naive Bayes and SVM), and run our experiments on Chinese-spam collection show that KG performs better than the IG method, specially, on extremely aggressive reduction. We conclude that the KG feature method have a state-of-the-art performance for filtering spam, especially for Chinese spam emails.
Keywords :
belief networks; information filtering; pattern classification; rough set theory; support vector machines; text analysis; unsolicited e-mail; Chinese spam collection; Chinese spam emails; Chinese spam filtering; Naive Bayes classifier; SVM classifier; antispam filtering; automatic feature selection method; knowledge gain feature selection method; knowledge quantity; rough set theory; text categorization; Accuracy; Set theory; Support vector machines; Text categorization; Training; Unsolicited electronic mail; artificial intelligence; clustering; feature selection; rough set; spam filtering;