• DocumentCode
    1309507
  • Title

    Cosdes: A Collaborative Spam Detection System with a Novel E-Mail Abstraction Scheme

  • Author

    Tseng, Chi-Yao ; Sung, Pin-Chieh ; Chen, Ming-Syan

  • Author_Institution
    Dept. of Electr. Eng., Nat. Taiwan Univ., Taipei, Taiwan
  • Volume
    23
  • Issue
    5
  • fYear
    2011
  • fDate
    5/1/2011 12:00:00 AM
  • Firstpage
    669
  • Lastpage
    682
  • Abstract
    E-mail communication is indispensable nowadays, but the e-mail spam problem continues growing drastically. In recent years, the notion of collaborative spam filtering with near-duplicate similarity matching scheme has been widely discussed. The primary idea of the similarity matching scheme for spam detection is to maintain a known spam database, formed by user feedback, to block subsequent near-duplicate spams. On purpose of achieving efficient similarity matching and reducing storage utilization, prior works mainly represent each e-mail by a succinct abstraction derived from e-mail content text. However, these abstractions of e-mails cannot fully catch the evolving nature of spams, and are thus not effective enough in near-duplicate detection. In this paper, we propose a novel e-mail abstraction scheme, which considers e-mail layout structure to represent e-mails. We present a procedure to generate the e-mail abstraction using HTML content in e-mail, and this newly devised abstraction can more effectively capture the near-duplicate phenomenon of spams. Moreover, we design a complete spam detection system Cosdes (standing for COllaborative Spam DEtection System), which possesses an efficient near-duplicate matching scheme and a progressive update scheme. The progressive update scheme enables system Cosdes to keep the most up-to-date information for near-duplicate detection. We evaluate Cosdes on a live data set collected from a real e-mail server and show that our system outperforms the prior approaches in detection results and is applicable to the real world.
  • Keywords
    groupware; hypermedia markup languages; information filtering; pattern matching; security of data; unsolicited e-mail; Cosdes; HTML content; collaborative spam detection; collaborative spam filtering; e-mail abstraction scheme; e-mail content text; e-mail layout structure; e-mail server; e-mail spam; known spam database; near-duplicate similarity matching scheme; storage utilization; succinct abstraction; user feedback; Spam detection; e-mail abstraction; near-duplicate matching.;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2010.147
  • Filename
    5560651