DocumentCode :
1708734
Title :
UniCrawl: A Practical Geographically Distributed Web Crawler
Author :
Do Le Quoc ; Fetzer, Christof ; Felber, Pascal ; Riviere, Etienne ; Schiavoni, Valerio ; Sutra, Pierre
Author_Institution :
Syst. Eng. Group, Dresden Univ. of Technol., Dresden, Germany
fYear :
2015
Firstpage :
389
Lastpage :
396
Abstract :
As the wealth of information available on the web keeps growing, being able to harvest massive amounts of data has become a major challenge. Web crawlers are the core components to retrieve such vast collections of publicly available data. The key limiting factor of any crawler architecture is however its large infrastructure cost. To reduce this cost, and in particular the high upfront investments, we present in this paper a geo-distributed crawler solution, UniCrawl. UniCrawl orchestrates several geographically distributed sites. Each site operates an independent crawler and relies on well-established techniques for fetching and parsing the content of the web. UniCrawl splits the crawled domain space across the sites and federates their storage and computing resources, while minimizing thee inter-site communication cost. To assess our design choices, we evaluate UniCrawl in a controlled environment using the ClueWeb12 dataset, and in the wild when deployed over several remote locations. We conducted several experiments over 3 sites spread across Germany. When compared to a centralized architecture with a crawler simply stretched over several locations, UniCrawl shows a performance improvement of 93.6% in terms of network bandwidth consumption, and a speedup factor of 1.75.
Keywords :
Internet; Web sites; information retrieval; ClueWeb12 dataset; Germany; UniCrawl; Web content fetching; Web content parsing; Web sites; centralized architecture; computing resources; crawler architecture; data retrieval; geo-distributed crawler solution; geographically distributed Web crawler; intersite communication cost; network bandwidth consumption; storage; Computer architecture; Crawlers; Distributed databases; Internet; Uniform resource locators; Web pages; cloud federation; geo-distributed system; map-reduce; storage; web crawler;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Cloud Computing (CLOUD), 2015 IEEE 8th International Conference on
Conference_Location :
New York City, NY
Print_ISBN :
978-1-4673-7286-2
Type :
conf
DOI :
10.1109/CLOUD.2015.59
Filename :
7214069
Link To Document :
بازگشت