Title :
Filtering Spam Using Kolmogorov Complexity Estimates
Author :
Spracklin, L.M. ; Saxton, L.V.
Author_Institution :
Ottawa Univ., Ottawa, ON
Abstract :
This paper introduces an adaptive filter which filters spam email based on Kolmogorov complexity estimates. The complexity filter is first trained exactly like a Bayesian filter. Each email is mapped to a string representation in which the tokens or words are represented by either 0 or 1. Tokens associated with spam are represented by 1 whereas those associated with non-spam, or ham, are represented by 0. Common tokens are ignored. The Kolmogorov complexity of this string representation is estimated using run-length compression. If the resulting Kolmogorov complexity is low then the email is classified as spam. Otherwise the email is classified as ham. The complexity filter can filter messages almost twice as fast as a comparable Bayesian filter and achieve accuracy rates of 80% to 96% While a Bayesian filter views an email as a "bag of words", the complexity filter uses token distribution information and is likely less vulnerable to statistical attack.
Keywords :
adaptive filters; computational complexity; data compression; information filtering; learning (artificial intelligence); pattern classification; unsolicited e-mail; Bayesian filter; Kolmogorov complexity estimation; adaptive filter; e-mail classification; run-length compression; spam email filtering; string representation; Adaptive filters; Bayesian methods; Frequency; HTML; Humans; Information filtering; Information filters; Speech; Text categorization; Unsolicited electronic mail;
Conference_Titel :
Advanced Information Networking and Applications Workshops, 2007, AINAW '07. 21st International Conference on
Conference_Location :
Niagara Falls, Ont.
Print_ISBN :
978-0-7695-2847-2
DOI :
10.1109/AINAW.2007.184