مرکز منطقه ای اطلاع رساني علوم و فناوري - Improving Anti-spam Engine with Large Imbalanced Dataset Using Information Retrieval Technology

DocumentCode :

3499751

Title :

Improving Anti-spam Engine with Large Imbalanced Dataset Using Information Retrieval Technology

Author :

Diao, LiLi ; Yang, Chengzhong

Author_Institution :

Trend Micro Inc., Nanjing, China

Volume :

fYear :

2010

fDate :

23-24 Oct. 2010

Firstpage :

271

Lastpage :

275

Abstract :

Anti-spam technology always employs machine learning to identify spam emails. Unfortunately, the email samples used to establish machine learning models are always not in a ideal status: there are too many spam emails compared with normal ones, which may lead to biased machine learning models and unsatisfactory performance in prediction. Besides, there are too many email samples, which lead to unaffordable resource consuming to run machine learning training process and thus difficult for human engineers to sort. In this paper, we proposed an information retrieval technology based approach to compress and balance the training data set. The key breakthrough here is to shrink and balance the training data set by removing similar data using information retrieval technology. Experiments show anti-spam classifier can have better performance with a much smaller and balanced training data set by applying this approach.

Keywords :

information retrieval; learning (artificial intelligence); pattern classification; unsolicited e-mail; anti-spam classifier; anti-spam technology; email spam identification; information retrieval technology; machine learning; anti-spam; information retrieval; similarity measure; training set compression;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Web Information Systems and Mining (WISM), 2010 International Conference on

Conference_Location :

Sanya

Print_ISBN :

978-1-4244-8438-6

Type :

conf

DOI :

10.1109/WISM.2010.139

Filename :

5662325

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3499751