Title :
Using Web Search Results and Genetic Algorithm to Improve the Accuracy of Chinese Spam Email Filters
Author :
Lu, Kai-Shin ; Chang, Carl K.
Author_Institution :
Dept. of Comput. Sci., Iowa State Univ., Ames, IA, USA
Abstract :
In recent years, many researches were focusing on developing effective spam email filters because spam emails became serious problems. Among all existing solutions, studies showed that the Naive Bayesian spam email filter was the best one because it could achieve the highest accuracy in filtering out English spam emails. However, how to filter out Chinese spam emails is still an open problem since it is difficult to correctly segment Chinese sentences. This paper presents a Web-Search-Results (WSR) based Genetic Algorithm (GA) Chinese sentence tokenizer which can automatically segment Chinese sentences. A fuzzy-splitting algorithm which helps GA handle longer sentences is also proposed. Besides, we show the implementation details of this tokenizer along with a standard Naive Bayesian email filter, and then we introduce the training and evaluation process. Evaluations on a real world spam email dataset "CCERT Data Sets of Chinese Emails" (CDSCE) showed that our approach effectively improves the accuracy of identifying Chinese spam emails.
Keywords :
Bayes methods; Web services; e-mail filters; fuzzy set theory; genetic algorithms; unsolicited e-mail; Chinese sentence segmentation; Chinese sentence tokenizer; Chinese spam email filters; English spam emails; Naive Bayesian spam email filter; Web search result based genetic algorithm; fuzzy-splitting algorithm; Accuracy; Bayesian methods; Biological cells; Databases; Electronic mail; Genetic algorithms; Web search; Email Filtering; Genetic Algorithm; Sentence Segement; Web Services;
Conference_Titel :
Computer Software and Applications Conference Workshops (COMPSACW), 2011 IEEE 35th Annual
Conference_Location :
Munich
Print_ISBN :
978-1-4577-0980-7
Electronic_ISBN :
978-0-7695-4459-5
DOI :
10.1109/COMPSACW.2011.56