• DocumentCode
    2404532
  • Title

    Design and implementation of a high-performance distributed Web crawler

  • Author

    Shkapenyuk, Vladislav ; Suel, Torsten

  • Author_Institution
    CIS Dept., Polytech. Univ. Brooklyn, NY, USA
  • fYear
    2002
  • fDate
    2002
  • Firstpage
    357
  • Lastpage
    368
  • Abstract
    Broad Web search engines as well as many more specialized search tools rely on Web crawlers to acquire large collections of pages for indexing and analysis. Such a Web crawler may interact with millions of hosts over a period of weeks or months, and thus issues of robustness, flexibility, and manageability are of major importance. In addition, I/O performance, network resources, and OS limits must be taken into account in order to achieve high performance at a reasonable cost. In this paper, we describe the design and implementation of a distributed Web crawler that runs on a network of workstations. The crawler scales to (at least) several hundred pages per second, is resilient against system crashes and other events, and can be adapted to various crawling applications. We present the software architecture of the system, discuss the, performance bottlenecks, and describe efficient techniques for achieving high performance. We also report preliminary experimental results based on a crawl of 120 million pages on 5 million hosts
  • Keywords
    Internet; hypermedia; information retrieval; search engines; software architecture; workstation clusters; I/O performance; OS limits; analysis; broad Web search engines; flexibility; high-performance distributed Web crawler; indexing; manageability; network of workstations; network resources; performance bottlenecks; robustness; software architecture; specialized search tools; Application software; Computer crashes; Costs; Crawlers; Indexing; Robustness; Search engines; Software architecture; Web search; Workstations;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering, 2002. Proceedings. 18th International Conference on
  • Conference_Location
    San Jose, CA
  • ISSN
    1063-6382
  • Print_ISBN
    0-7695-1531-2
  • Type

    conf

  • DOI
    10.1109/ICDE.2002.994750
  • Filename
    994750