• DocumentCode
    2360086
  • Title

    Cooperative crawling

  • Author

    Buzzi, Marina

  • Author_Institution
    IIT-CNR, Italy
  • fYear
    2003
  • fDate
    10-12 Nov. 2003
  • Firstpage
    209
  • Lastpage
    211
  • Abstract
    Web crawler design presents many different challenges: architecture, strategies, performance and more. One of the most important research topics concerns improving the selection of "interesting" Web pages (for the user), according to importance metrics. Another relevant point is content freshness, i.e. maintaining freshness and consistency of temporary stored copies. For this, the crawler periodically repeats its activity going over stored contents (recrawling process). We propose a scheme to permit a crawler to acquire information about the global state of a Website before the crawling process takes place. This scheme requires Web server cooperation in order to collect and publish information on its content, useful for enabling a crawler to tune its visit strategy. If this information is unavailable or not updated the crawler still acts in the usual manner. In this sense the proposed scheme is not invasive and is independent from any crawling strategy and architecture.
  • Keywords
    Internet; Web sites; search engines; Web crawler design; Web pages; Web server; Website; cooperative crawling; information publishing; search engine; Context modeling; Crawlers; Frequency; Inspection; Search engines; Service oriented architecture; Uniform resource locators; Web pages; Web server; World Wide Web;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Web Congress, 2003. Proceedings. First Latin American
  • Print_ISBN
    0-7695-2058-8
  • Type

    conf

  • DOI
    10.1109/LAWEB.2003.1250300
  • Filename
    1250300