• DocumentCode
    659478
  • Title

    Malicious URL filtering — A big data application

  • Author

    Min-Sheng Lin ; Chien-Yi Chiu ; Yuh-Jye Lee ; Hsing-Kuo Pao

  • Author_Institution
    Dept. of Comput. Sci. & Inf. Eng., Nat. Taiwan Univ. of Sci. & Technol., Taipei, Taiwan
  • fYear
    2013
  • fDate
    6-9 Oct. 2013
  • Firstpage
    589
  • Lastpage
    596
  • Abstract
    Malicious URLs have become a channel for Internet criminal activities such as drive-by-download, spamming and phishing. Applications for the detection of malicious URLs are accurate but slow (because they need to download the content or query some Internet host information). In this paper we present a novel lightweight filter based only on the URL string itself to use before existing processing methods. We run experiments on a large dataset and demonstrate a 75% reduction in workload size while retaining at least 90% of malicious URLs. Existing methods do not scale well with the hundreds of millions of URLs encountered every day as the problem is a heavily-imbalanced, large-scale binary classification problem. Our proposed method is able to handle nearly two million URLs in less than five minutes. We generate two filtering models by using lexical features and descriptive features, and then combine the filtering results. The on-line learning algorithms are applied here not only for dealing with large-scale data sets but also for fitting the very short lifetime characteristics of malicious URLs. Our filter can significantly reduce the volume of URL queries on which further analysis needs to be performed, saving both computing time and bandwidth used for content retrieval.
  • Keywords
    Internet; computer crime; learning (artificial intelligence); pattern classification; query processing; Internet criminal activities; URL queries; URL string; big data application; content retrieval; drive-by-download; heavily-imbalanced large-scale binary classilication problem; lifetime characteristics; lightweight lilter; malicious URL filtering; on-line learning algorithms; phishing; spamming; Dictionaries; Feature extraction; IP networks; Prediction algorithms; Predictive models; Training; Web sites; Data Mining; Information Filtering; Information Security; Machine learning;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Big Data, 2013 IEEE International Conference on
  • Conference_Location
    Silicon Valley, CA
  • Type

    conf

  • DOI
    10.1109/BigData.2013.6691627
  • Filename
    6691627