Title :
Google Penguin: Evasion in Non-English Languages and a New Classifier
Author :
Alarifi, Abdulrahman ; Alsaleh, Mansour ; Al-Salman, AbdulMalik ; Alswayed, Abdulmajeed ; Alkhaledi, Ahmad
Author_Institution :
Comput. Res. Inst., King Abdulaziz City for Sci. & Technol., Riyadh, Saudi Arabia
Abstract :
Web spam techniques aim to mislead search engines so that web spam pages get ranked higher than they deserve. This leads to misleading search results as spam pages might appear in search results although the content of these spam pages might not be related to the search terms. Despite the effort of search engines to deploy various techniques to detect and filter out web spam pages from being listed in their search results, spammers continue to develop new tactics to evade search engines detection mechanisms. In this paper, we study the effectiveness and accuracy of newly developed anti-spamming techniques in Google search engine. Focusing on Arabic spam pages, our study results show that Google anti-spamming techniques are ineffective against spam pages with Arabic content. We explore various types of web spam detection features to obtain an appropriate set of detection features that yield a reasonable detection accuracy. In order to build and evaluate our classifier, we collect and manually label a dataset of Arabic web pages, including both benign and spam pages. We believe this Arabic web spam corpus helps researchers in conducting sound measurement studies. We also develop a browser plug-in that utilizes our classifier and warns the user about web spam pages before accessing them, upon clicking on a search term. The plug-in has also the ability to filter out search engine results.
Keywords :
Internet; natural language processing; search engines; unsolicited e-mail; Arabic Web spam corpus; Arabic spam pages; Google penguin; Google search engine; Web spam detection; Web spam pages; Web spam techniques; new classifier; non english languages; search engines detection mechanisms; search terms; Browsers; Feature extraction; Google; Market research; Search engines; Unsolicited electronic mail; Web pages; Content spam; Link spam; Search engine spam; Spamdexing; Web spam;
Conference_Titel :
Machine Learning and Applications (ICMLA), 2013 12th International Conference on
Conference_Location :
Miami, FL
DOI :
10.1109/ICMLA.2013.135