Title :
A distributed vertical crawler using crawling-period based strategy
Author :
Zhou, Bing ; Xiao, Bo ; Lin, Zhiqing ; Zhang, Chuang
Author_Institution :
Pattern Recognition & Intell. Syst., Beijing Univ. of Posts & Telecommun., Beijing, China
Abstract :
Due to the explosive growth of the Web pages, centralized crawlers are no longer sufficient to run on the Web efficiently. There are many distributed crawlers in wide use; however, none of them is suitable for template-customized vertical crawling. In this paper, we present a distributed template-customized vertical crawler which is specially used for crawling Internet forums. The Client-Server architecture of the system and the function of every module are described in detail which can be extended to other fields easily. A crawling-period based distribution strategy is also proposed, with which the crawler manager can coordinate the quantity of crawling tasks and the resources of each crawler very well, and the crawler can process Websites with different updating frequency flexibly. We also define a communication protocol between crawlers and crawler manager and describe how to solve the duplicated crawling problem in the distributed system. The performance of centralized vertical crawler and distributed vertical crawler are compared in the experiment. Experimental results demonstrate that the parallel operation of all the crawlers in the distributed system can greatly enhance the crawling efficiency.
Keywords :
Web sites; client-server systems; protocols; Internet forum; Web page; Web site; centralized vertical crawler; client-server architecture; communication protocol; crawler manager; crawling task; crawling-period based distribution strategy; distributed system; distributed template-customized vertical crawler; parallel operation; Crawlers; Discussion forums; Information analysis; Intelligent structures; Intelligent systems; Pattern recognition; Queueing analysis; Search engines; Uniform resource locators; Web pages; Crawling-Period; crawler manager; distributed system; forum crawler; vertical crawler;
Conference_Titel :
Future Computer and Communication (ICFCC), 2010 2nd International Conference on
Conference_Location :
Wuhan
Print_ISBN :
978-1-4244-5821-9
DOI :
10.1109/ICFCC.2010.5497780