DocumentCode :
659478
Title :
Malicious URL filtering — A big data application
Author :
Min-Sheng Lin ; Chien-Yi Chiu ; Yuh-Jye Lee ; Hsing-Kuo Pao
Author_Institution :
Dept. of Comput. Sci. & Inf. Eng., Nat. Taiwan Univ. of Sci. & Technol., Taipei, Taiwan
fYear :
2013
fDate :
6-9 Oct. 2013
Firstpage :
589
Lastpage :
596
Abstract :
Malicious URLs have become a channel for Internet criminal activities such as drive-by-download, spamming and phishing. Applications for the detection of malicious URLs are accurate but slow (because they need to download the content or query some Internet host information). In this paper we present a novel lightweight filter based only on the URL string itself to use before existing processing methods. We run experiments on a large dataset and demonstrate a 75% reduction in workload size while retaining at least 90% of malicious URLs. Existing methods do not scale well with the hundreds of millions of URLs encountered every day as the problem is a heavily-imbalanced, large-scale binary classification problem. Our proposed method is able to handle nearly two million URLs in less than five minutes. We generate two filtering models by using lexical features and descriptive features, and then combine the filtering results. The on-line learning algorithms are applied here not only for dealing with large-scale data sets but also for fitting the very short lifetime characteristics of malicious URLs. Our filter can significantly reduce the volume of URL queries on which further analysis needs to be performed, saving both computing time and bandwidth used for content retrieval.
Keywords :
Internet; computer crime; learning (artificial intelligence); pattern classification; query processing; Internet criminal activities; URL queries; URL string; big data application; content retrieval; drive-by-download; heavily-imbalanced large-scale binary classilication problem; lifetime characteristics; lightweight lilter; malicious URL filtering; on-line learning algorithms; phishing; spamming; Dictionaries; Feature extraction; IP networks; Prediction algorithms; Predictive models; Training; Web sites; Data Mining; Information Filtering; Information Security; Machine learning;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Big Data, 2013 IEEE International Conference on
Conference_Location :
Silicon Valley, CA
Type :
conf
DOI :
10.1109/BigData.2013.6691627
Filename :
6691627
Link To Document :
بازگشت