• DocumentCode
    3423460
  • Title

    An evaluation of lightweight classification methods for identifying malicious URLs

  • Author

    Egan, S. ; Irwin, Barry

  • Author_Institution
    Dept. of Comput. Sci., Rhodes Univ., Grahamstown, South Africa
  • fYear
    2011
  • fDate
    15-17 Aug. 2011
  • Firstpage
    1
  • Lastpage
    6
  • Abstract
    Recent research has shown that it is possible to identify malicious URLs through lexical analysis of their URL structures alone. This paper intends to explore the effectiveness of these lightweight classification algorithms when working with large real world datasets including lists of malicious URLs obtained from Phishtank as well as largely filtered benign URLs obtained from proxy traffic logs. Lightweight algorithms are defined as methods by which URLs are analysed that do not use external sources of information such as WHOIS lookups, blacklist lookups and content analysis. These parameters include URL length, number of delimiters as well as the number of traversals through the directory structure and are used throughout much of the research in the paradigm of lightweight classification. Methods which include external sources of information are often called fully featured classifications and have been shown to be only slightly more effective than a purely lexical analysis when considering both false-positives and false-negatives. This distinction allows these algorithms to be run client side without the introduction of additional latency, but still providing a high level of accuracy through the use of modern techniques in training classifiers. Analysis of this type will also be useful in an incident response analysis where large numbers of URLs need to be filtered for potentially malicious URLs as an initial step in information gathering as well as end user implementations such as browser extensions which could help protect the user from following potentially malicious links. Both AROW and CW classifier update methods will be used as prototype implementations and their effectiveness will be compared to fully featured analysis results. These methods are interesting because they are able to train on any labelled data, including instances in which their prediction is correct, allowing them to build a confidence in specific lexical features. This makes it possible fo- - r them to be trained using noisy input data, making them ideal for real world applications such as link filtering and information gathering.
  • Keywords
    Internet; information analysis; pattern classification; security of data; CW classifier update methods; Phishtank; URL length; WHOIS lookups; blacklist lookups; content analysis; directory structure; incident response analysis; information gathering; lexical analysis; lightweight classification algorithms; link filtering; malicious URL identification; proxy traffic logs; Algorithm design and analysis; Biological neural networks; Browsers; Electronic mail; IP networks; Neurons; Training data; Content filtering; Heuristics; Malware; Phishing; URL classification;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Information Security South Africa (ISSA), 2011
  • Conference_Location
    Johannesburg
  • Print_ISBN
    978-1-4577-1481-8
  • Type

    conf

  • DOI
    10.1109/ISSA.2011.6027532
  • Filename
    6027532