Title :
Phishing website detection using Latent Dirichlet Allocation and AdaBoost
Author :
Ramanathan, Venkatesh ; Wechsler, Harry
Author_Institution :
Dept. of Comput. Sci., George Mason Univ., Fairfax, VA, USA
Abstract :
One of the ways criminals steal identity in the cyberspace is using phishing. Attackers host phishing websites that resemble a legitimate website and entice users to click on hyperlinks which directs them to these fake websites. Attackers use these fake sites to capture personal information such as login, passwords and social security numbers from innocent victims, which they later use to commit crimes. We propose here a robust methodology to detect phishing websites that employs for semantic analysis a topic modeling technique, Latent Dirichlet Allocation, and for classification, AdaBoost. The methodology developed is a content driven approach that is device independent and language neutral. The website content of mobile and desktop clients are collected by employing an intelligent web crawler. The website contents that are not in English are translated to English using Google´s language translator. Topic model is built using the translated contents of desktop and mobile clients. The phishing website classifier is built using (i) distribution probabilities for the topics found as features using Latent Dirichlet Allocation and (ii) AdaBoost voting technique. Experiments were conducted using one of the large public corpus of website data containing 47500 phishing websites and 52500 good websites. Results show that our method achieves a F-measure of 99%.
Keywords :
Web sites; computer crime; language translation; learning (artificial intelligence); mobile computing; natural language processing; pattern classification; statistical distributions; text analysis; AdaBoost voting technique; English language; F-measure; Google language translator; Website content; Website data; content driven approach; content translation; cyberspace; desktop client; fake Websites; hyperlink clicking; identity stealing; intelligent Web crawler; latent Dirichlet allocation; legitimate Website; login; mobile clients; password; personal information; phishing Website detection; phishing website classification; semantic analysis; social security number; topic distribution probability; topic modeling technique; website content; Crawlers; Feature extraction; Internet; Mobile communication; Mobile handsets; Resource management; Robustness; boosting; detection; identity theft; machine learning; natural language processing; phishing website; semantic analysis;
Conference_Titel :
Intelligence and Security Informatics (ISI), 2012 IEEE International Conference on
Conference_Location :
Arlington, VA
Print_ISBN :
978-1-4673-2105-1
DOI :
10.1109/ISI.2012.6284100