Malicious URL filtering — A big data application

Author

Min-Sheng Lin ; Chien-Yi Chiu ; Yuh-Jye Lee ; Hsing-Kuo Pao

Author_Institution

Dept. of Comput. Sci. & Inf. Eng., Nat. Taiwan Univ. of Sci. & Technol., Taipei, Taiwan

fYear

2013

fDate

6-9 Oct. 2013

Firstpage

589

Lastpage

596

Abstract

Malicious URLs have become a channel for Internet criminal activities such as drive-by-download, spamming and phishing. Applications for the detection of malicious URLs are accurate but slow (because they need to download the content or query some Internet host information). In this paper we present a novel lightweight filter based only on the URL string itself to use before existing processing methods. We run experiments on a large dataset and demonstrate a 75% reduction in workload size while retaining at least 90% of malicious URLs. Existing methods do not scale well with the hundreds of millions of URLs encountered every day as the problem is a heavily-imbalanced, large-scale binary classification problem. Our proposed method is able to handle nearly two million URLs in less than five minutes. We generate two filtering models by using lexical features and descriptive features, and then combine the filtering results. The on-line learning algorithms are applied here not only for dealing with large-scale data sets but also for fitting the very short lifetime characteristics of malicious URLs. Our filter can significantly reduce the volume of URL queries on which further analysis needs to be performed, saving both computing time and bandwidth used for content retrieval.

Keywords

Internet; computer crime; learning (artificial intelligence); pattern classification; query processing; Internet criminal activities; URL queries; URL string; big data application; content retrieval; drive-by-download; heavily-imbalanced large-scale binary classilication problem; lifetime characteristics; lightweight lilter; malicious URL filtering; on-line learning algorithms; phishing; spamming; Dictionaries; Feature extraction; IP networks; Prediction algorithms; Predictive models; Training; Web sites; Data Mining; Information Filtering; Information Security; Machine learning;

fLanguage

English

Publisher

ieee

Conference_Titel

Big Data, 2013 IEEE International Conference on

Conference_Location

Silicon Valley, CA

Type

conf

DOI

10.1109/BigData.2013.6691627

Filename

6691627