Title :
Aiding web crawlers; projecting web page last modification
Author :
Anjum, Ashiq ; Anjum, Ashiq
Author_Institution :
Ecole Polytech., Univ. of Nantes, Nantes, France
Abstract :
Due to colossal amount of data on the Web, Web archivists typically make use of Web crawlers for automated collection. The Internet Archive is the largest organization based on a crawling approach in order to maintain an archive of the entire Web. The most important requirement of a Web crawler, specially when they are used for Web archiving, is to be aware of the date (and time) of last modification of a Web page. This strategy has various advantages, most important of them include i) presentation of an up-to-date version of a Web page to the end user ii) ease of adjusting the crawl rate that allows future retrieval of a Web page´s version at a given date, or to compute its refresh rate. The typical way for this modification information of a Web page, that is, to use the Last-Modified: HTTP header, unfortunately does not provide correct information every time. In this work, we discuss various techniques that can be used to determine the date of last modification of a Web page with the help of experiments. This will help in adjusting the crawl rate for a specific page and also helps in presenting users with up to date information and thus allowing future versioning of a Web page more meticulous.
Keywords :
Internet; information retrieval; HTTP header; Internet archive; Web archiving; Web crawler; Web page last modification projection; Web page retrieval; Web page versioning; crawling approach; hypertext transfer protocol; refresh rate; HTTP response headers; Web Archive; Web Crawlers; World Wide Web;
Conference_Titel :
Multitopic Conference (INMIC), 2012 15th International
Conference_Location :
Islamabad
Print_ISBN :
978-1-4673-2249-2
DOI :
10.1109/INMIC.2012.6511443