• DocumentCode
    3230326
  • Title

    A Memory-Efficient Strategy for Exploring the Web

  • Author

    Castillo, Carlos ; Nelli, Alberto ; Panconesi, Alessandro

  • Author_Institution
    Univ. di Roma "La Sapienza", Rome
  • fYear
    2006
  • fDate
    18-22 Dec. 2006
  • Firstpage
    680
  • Lastpage
    686
  • Abstract
    Search engines rely on Web crawlers to create an index of the Web. Web crawlers explore the Web downloading pages and finding links to new pages to be explored. At any given moment, there are a number of pages waiting to be downloaded in the crawler queue. We study the growth of this queue of pending pages during a crawl of a large subset of the Web. In a normal breadth-first crawler, the queue quickly grows very large. We present a strategy for managing the pending queue that reduces its maximum size by 50% while preserving the coverage and quality of the pages visited. This can be applied to general purpose Web crawlers as well as topic-specific crawling, peer-to-peer search, on-demand Web crawling, and other environments in which memory usage has to be kept to a minimum
  • Keywords
    Internet; search engines; Web Search engines; Web crawlers; Web downloading pages; Web index; crawler queue; memory usage; memory-efficient strategy; normal breadth-first crawler; on-demand Web crawling; peer-to-peer search; pending queue management; topic-specific crawling; Crawlers; Large-scale systems; Peer to peer computing; Quality management; Remuneration; Search engines; Visualization; Web pages; Web search; Web server;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Web Intelligence, 2006. WI 2006. IEEE/WIC/ACM International Conference on
  • Conference_Location
    Hong Kong
  • Print_ISBN
    0-7695-2747-7
  • Type

    conf

  • DOI
    10.1109/WI.2006.18
  • Filename
    4061453