• DocumentCode
    464190
  • Title

    Filtering Spam Using Kolmogorov Complexity Estimates

  • Author

    Spracklin, L.M. ; Saxton, L.V.

  • Author_Institution
    Ottawa Univ., Ottawa, ON
  • Volume
    1
  • fYear
    2007
  • fDate
    21-23 May 2007
  • Firstpage
    321
  • Lastpage
    328
  • Abstract
    This paper introduces an adaptive filter which filters spam email based on Kolmogorov complexity estimates. The complexity filter is first trained exactly like a Bayesian filter. Each email is mapped to a string representation in which the tokens or words are represented by either 0 or 1. Tokens associated with spam are represented by 1 whereas those associated with non-spam, or ham, are represented by 0. Common tokens are ignored. The Kolmogorov complexity of this string representation is estimated using run-length compression. If the resulting Kolmogorov complexity is low then the email is classified as spam. Otherwise the email is classified as ham. The complexity filter can filter messages almost twice as fast as a comparable Bayesian filter and achieve accuracy rates of 80% to 96% While a Bayesian filter views an email as a "bag of words", the complexity filter uses token distribution information and is likely less vulnerable to statistical attack.
  • Keywords
    adaptive filters; computational complexity; data compression; information filtering; learning (artificial intelligence); pattern classification; unsolicited e-mail; Bayesian filter; Kolmogorov complexity estimation; adaptive filter; e-mail classification; run-length compression; spam email filtering; string representation; Adaptive filters; Bayesian methods; Frequency; HTML; Humans; Information filtering; Information filters; Speech; Text categorization; Unsolicited electronic mail;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Advanced Information Networking and Applications Workshops, 2007, AINAW '07. 21st International Conference on
  • Conference_Location
    Niagara Falls, Ont.
  • Print_ISBN
    978-0-7695-2847-2
  • Type

    conf

  • DOI
    10.1109/AINAW.2007.184
  • Filename
    4221080