DocumentCode
1806568
Title
Around the web in six weeks: Documenting a large-scale crawl
Author
Ahmed, Sarker Tanzir ; Sparkman, Clint ; Hsin-Tsang Lee ; Loguinov, Dmitri
Author_Institution
Dept. of Comput. Sci. & Eng., Texas A&M Univ., College Station, TX, USA
fYear
2015
fDate
April 26 2015-May 1 2015
Firstpage
1598
Lastpage
1606
Abstract
Exponential growth of the web continues to present challenges to the design and scalability of web crawlers. Our previous work on a high-performance platform called IRLbot [28] led to the development of new algorithms for realtime URL manipulation, domain ranking, and budgeting, which were tested in a 6.3B-page crawl. Since very little is known about the crawl itself, our goal in this paper is to undertake an extensive measurement study of the collected dataset and document its crawl dynamics. We also propose a framework for modeling the scaling rate of various data structures as crawl size goes to infinity and offer a methodology for comparing crawl coverage to that of commercial search engines.
Keywords
information retrieval; search engines; IRLbot platform; URL manipulation; Web crawlers; budgeting; crawl coverage; crawl dynamics; crawl size; data structures; domain ranking; large-scale crawl documentation; Admission control; Bandwidth; Crawlers; HTML; Robots; Servers; Uniform resource locators;
fLanguage
English
Publisher
ieee
Conference_Titel
Computer Communications (INFOCOM), 2015 IEEE Conference on
Conference_Location
Kowloon
Type
conf
DOI
10.1109/INFOCOM.2015.7218539
Filename
7218539
Link To Document