Title :
Web crawler with URL signature — A performance study
Author :
Soon, Lay-Ki ; Ku, Yee-Ern ; Lee, Sang Ho
Author_Institution :
Fac. of Comput. & Inf., Multimedia Univ., Cyberjaya, Malaysia
Abstract :
URL signature was proposed to be implemented in web crawling, aiming to avoid processing duplicated web pages for further web crawling. In this paper, we present our performance study on an open source web crawler - WebSPHINX, in which we have embedded URL signature. The experimental result indicates that URL signature is able to reduce the processing of duplicated web pages significantly for further web crawling at a negligible cost compared to the one without URL signature.
Keywords :
Internet; Web sites; hypermedia markup languages; open loop systems; Web crawling; WebSPHINX; Website-Specific Processors for HTML INformation eXtraction; duplicated Web page processing; embedded URL signature; open source Web crawler; Crawlers; Data mining; Educational institutions; HTML; Standards; Uniform resource locators; Web pages; URL normalization; URL signature; web crawling;
Conference_Titel :
Data Mining and Optimization (DMO), 2012 4th Conference on
Conference_Location :
Langkawi
Print_ISBN :
978-1-4673-2717-6
DOI :
10.1109/DMO.2012.6329810