DocumentCode
2832343
Title
A distributed vertical crawler using crawling-period based strategy
Author
Zhou, Bing ; Xiao, Bo ; Lin, Zhiqing ; Zhang, Chuang
Author_Institution
Pattern Recognition & Intell. Syst., Beijing Univ. of Posts & Telecommun., Beijing, China
Volume
1
fYear
2010
fDate
21-24 May 2010
Abstract
Due to the explosive growth of the Web pages, centralized crawlers are no longer sufficient to run on the Web efficiently. There are many distributed crawlers in wide use; however, none of them is suitable for template-customized vertical crawling. In this paper, we present a distributed template-customized vertical crawler which is specially used for crawling Internet forums. The Client-Server architecture of the system and the function of every module are described in detail which can be extended to other fields easily. A crawling-period based distribution strategy is also proposed, with which the crawler manager can coordinate the quantity of crawling tasks and the resources of each crawler very well, and the crawler can process Websites with different updating frequency flexibly. We also define a communication protocol between crawlers and crawler manager and describe how to solve the duplicated crawling problem in the distributed system. The performance of centralized vertical crawler and distributed vertical crawler are compared in the experiment. Experimental results demonstrate that the parallel operation of all the crawlers in the distributed system can greatly enhance the crawling efficiency.
Keywords
Web sites; client-server systems; protocols; Internet forum; Web page; Web site; centralized vertical crawler; client-server architecture; communication protocol; crawler manager; crawling task; crawling-period based distribution strategy; distributed system; distributed template-customized vertical crawler; parallel operation; Crawlers; Discussion forums; Information analysis; Intelligent structures; Intelligent systems; Pattern recognition; Queueing analysis; Search engines; Uniform resource locators; Web pages; Crawling-Period; crawler manager; distributed system; forum crawler; vertical crawler;
fLanguage
English
Publisher
ieee
Conference_Titel
Future Computer and Communication (ICFCC), 2010 2nd International Conference on
Conference_Location
Wuhan
Print_ISBN
978-1-4244-5821-9
Type
conf
DOI
10.1109/ICFCC.2010.5497780
Filename
5497780
Link To Document