DocumentCode
2107499
Title
Aiding web crawlers; projecting web page last modification
Author
Anjum, Ashiq ; Anjum, Ashiq
Author_Institution
Ecole Polytech., Univ. of Nantes, Nantes, France
fYear
2012
fDate
13-15 Dec. 2012
Firstpage
245
Lastpage
252
Abstract
Due to colossal amount of data on the Web, Web archivists typically make use of Web crawlers for automated collection. The Internet Archive is the largest organization based on a crawling approach in order to maintain an archive of the entire Web. The most important requirement of a Web crawler, specially when they are used for Web archiving, is to be aware of the date (and time) of last modification of a Web page. This strategy has various advantages, most important of them include i) presentation of an up-to-date version of a Web page to the end user ii) ease of adjusting the crawl rate that allows future retrieval of a Web page´s version at a given date, or to compute its refresh rate. The typical way for this modification information of a Web page, that is, to use the Last-Modified: HTTP header, unfortunately does not provide correct information every time. In this work, we discuss various techniques that can be used to determine the date of last modification of a Web page with the help of experiments. This will help in adjusting the crawl rate for a specific page and also helps in presenting users with up to date information and thus allowing future versioning of a Web page more meticulous.
Keywords
Internet; information retrieval; HTTP header; Internet archive; Web archiving; Web crawler; Web page last modification projection; Web page retrieval; Web page versioning; crawling approach; hypertext transfer protocol; refresh rate; HTTP response headers; Web Archive; Web Crawlers; World Wide Web;
fLanguage
English
Publisher
ieee
Conference_Titel
Multitopic Conference (INMIC), 2012 15th International
Conference_Location
Islamabad
Print_ISBN
978-1-4673-2249-2
Type
conf
DOI
10.1109/INMIC.2012.6511443
Filename
6511443
Link To Document