• DocumentCode
    2107499
  • Title

    Aiding web crawlers; projecting web page last modification

  • Author

    Anjum, Ashiq ; Anjum, Ashiq

  • Author_Institution
    Ecole Polytech., Univ. of Nantes, Nantes, France
  • fYear
    2012
  • fDate
    13-15 Dec. 2012
  • Firstpage
    245
  • Lastpage
    252
  • Abstract
    Due to colossal amount of data on the Web, Web archivists typically make use of Web crawlers for automated collection. The Internet Archive is the largest organization based on a crawling approach in order to maintain an archive of the entire Web. The most important requirement of a Web crawler, specially when they are used for Web archiving, is to be aware of the date (and time) of last modification of a Web page. This strategy has various advantages, most important of them include i) presentation of an up-to-date version of a Web page to the end user ii) ease of adjusting the crawl rate that allows future retrieval of a Web page´s version at a given date, or to compute its refresh rate. The typical way for this modification information of a Web page, that is, to use the Last-Modified: HTTP header, unfortunately does not provide correct information every time. In this work, we discuss various techniques that can be used to determine the date of last modification of a Web page with the help of experiments. This will help in adjusting the crawl rate for a specific page and also helps in presenting users with up to date information and thus allowing future versioning of a Web page more meticulous.
  • Keywords
    Internet; information retrieval; HTTP header; Internet archive; Web archiving; Web crawler; Web page last modification projection; Web page retrieval; Web page versioning; crawling approach; hypertext transfer protocol; refresh rate; HTTP response headers; Web Archive; Web Crawlers; World Wide Web;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Multitopic Conference (INMIC), 2012 15th International
  • Conference_Location
    Islamabad
  • Print_ISBN
    978-1-4673-2249-2
  • Type

    conf

  • DOI
    10.1109/INMIC.2012.6511443
  • Filename
    6511443