DocumentCode :
1733960
Title :
Web crawler with URL signature — A performance study
Author :
Soon, Lay-Ki ; Ku, Yee-Ern ; Lee, Sang Ho
Author_Institution :
Fac. of Comput. & Inf., Multimedia Univ., Cyberjaya, Malaysia
fYear :
2012
Firstpage :
127
Lastpage :
130
Abstract :
URL signature was proposed to be implemented in web crawling, aiming to avoid processing duplicated web pages for further web crawling. In this paper, we present our performance study on an open source web crawler - WebSPHINX, in which we have embedded URL signature. The experimental result indicates that URL signature is able to reduce the processing of duplicated web pages significantly for further web crawling at a negligible cost compared to the one without URL signature.
Keywords :
Internet; Web sites; hypermedia markup languages; open loop systems; Web crawling; WebSPHINX; Website-Specific Processors for HTML INformation eXtraction; duplicated Web page processing; embedded URL signature; open source Web crawler; Crawlers; Data mining; Educational institutions; HTML; Standards; Uniform resource locators; Web pages; URL normalization; URL signature; web crawling;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Mining and Optimization (DMO), 2012 4th Conference on
Conference_Location :
Langkawi
Print_ISBN :
978-1-4673-2717-6
Type :
conf
DOI :
10.1109/DMO.2012.6329810
Filename :
6329810
Link To Document :
بازگشت