DocumentCode
464190
Title
Filtering Spam Using Kolmogorov Complexity Estimates
Author
Spracklin, L.M. ; Saxton, L.V.
Author_Institution
Ottawa Univ., Ottawa, ON
Volume
1
fYear
2007
fDate
21-23 May 2007
Firstpage
321
Lastpage
328
Abstract
This paper introduces an adaptive filter which filters spam email based on Kolmogorov complexity estimates. The complexity filter is first trained exactly like a Bayesian filter. Each email is mapped to a string representation in which the tokens or words are represented by either 0 or 1. Tokens associated with spam are represented by 1 whereas those associated with non-spam, or ham, are represented by 0. Common tokens are ignored. The Kolmogorov complexity of this string representation is estimated using run-length compression. If the resulting Kolmogorov complexity is low then the email is classified as spam. Otherwise the email is classified as ham. The complexity filter can filter messages almost twice as fast as a comparable Bayesian filter and achieve accuracy rates of 80% to 96% While a Bayesian filter views an email as a "bag of words", the complexity filter uses token distribution information and is likely less vulnerable to statistical attack.
Keywords
adaptive filters; computational complexity; data compression; information filtering; learning (artificial intelligence); pattern classification; unsolicited e-mail; Bayesian filter; Kolmogorov complexity estimation; adaptive filter; e-mail classification; run-length compression; spam email filtering; string representation; Adaptive filters; Bayesian methods; Frequency; HTML; Humans; Information filtering; Information filters; Speech; Text categorization; Unsolicited electronic mail;
fLanguage
English
Publisher
ieee
Conference_Titel
Advanced Information Networking and Applications Workshops, 2007, AINAW '07. 21st International Conference on
Conference_Location
Niagara Falls, Ont.
Print_ISBN
978-0-7695-2847-2
Type
conf
DOI
10.1109/AINAW.2007.184
Filename
4221080
Link To Document