DocumentCode
3230326
Title
A Memory-Efficient Strategy for Exploring the Web
Author
Castillo, Carlos ; Nelli, Alberto ; Panconesi, Alessandro
Author_Institution
Univ. di Roma "La Sapienza", Rome
fYear
2006
fDate
18-22 Dec. 2006
Firstpage
680
Lastpage
686
Abstract
Search engines rely on Web crawlers to create an index of the Web. Web crawlers explore the Web downloading pages and finding links to new pages to be explored. At any given moment, there are a number of pages waiting to be downloaded in the crawler queue. We study the growth of this queue of pending pages during a crawl of a large subset of the Web. In a normal breadth-first crawler, the queue quickly grows very large. We present a strategy for managing the pending queue that reduces its maximum size by 50% while preserving the coverage and quality of the pages visited. This can be applied to general purpose Web crawlers as well as topic-specific crawling, peer-to-peer search, on-demand Web crawling, and other environments in which memory usage has to be kept to a minimum
Keywords
Internet; search engines; Web Search engines; Web crawlers; Web downloading pages; Web index; crawler queue; memory usage; memory-efficient strategy; normal breadth-first crawler; on-demand Web crawling; peer-to-peer search; pending queue management; topic-specific crawling; Crawlers; Large-scale systems; Peer to peer computing; Quality management; Remuneration; Search engines; Visualization; Web pages; Web search; Web server;
fLanguage
English
Publisher
ieee
Conference_Titel
Web Intelligence, 2006. WI 2006. IEEE/WIC/ACM International Conference on
Conference_Location
Hong Kong
Print_ISBN
0-7695-2747-7
Type
conf
DOI
10.1109/WI.2006.18
Filename
4061453
Link To Document