• DocumentCode
    1667974
  • Title

    Scheduling algorithms for Web crawling

  • Author

    Castillo, Carlos ; Marin, Mauricio ; Rodriguez, Andrea ; Yates, Ricardo Baeza

  • Author_Institution
    Center for Web Res., Chile Univ., Chile
  • fYear
    2004
  • Firstpage
    10
  • Lastpage
    17
  • Abstract
    This paper presents a comparative study of strategies for Web crawling. We show that a combination of breadth-first ordering with the largest sites first is a practical alternative since it is fast, simple to implement, and able to retrieve the best ranked pages at a rate that is closer to the optimal than other alternatives. Our study was performed on a large sample of the Chilean Web which was crawled by using simulators, so that all strategies were compared under the same conditions, and actual crawls to validate our conclusions. We also explored the effects of large scale parallelism in the page retrieval task and multiple-page requests in a single connection for effective amortization of latency times.
  • Keywords
    Internet; information retrieval; scheduling; Chilean Web; Web crawling; Web page ranking; breadth-first ordering; multiple-page requests; page retrieval; scheduling algorithms; Crawlers; Delay; Indexing; Large-scale systems; Scheduling algorithm; Search engines; Service oriented architecture; Testing; Web pages; Web search;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    WebMedia and LA-Web, 2004. Proceedings
  • Print_ISBN
    0-7695-2237-8
  • Type

    conf

  • DOI
    10.1109/WEBMED.2004.1348139
  • Filename
    1348139