DocumentCode
2909926
Title
A GNP-Based Scheduling Strategy for Distributed Crawling
Author
Liu, Shuang ; Xu, Xiao ; Li, Dong ; Zhang, Wei-Zhe ; Liu, Xin-Ran
Author_Institution
Dept. of Comput. Sci. & Technol., Harbin Inst. of Technol., Harbin, China
fYear
2009
fDate
7-8 Nov. 2009
Firstpage
651
Lastpage
655
Abstract
In order to solve task scheduling and load balancing problems of distributed search engines, a GNP-based scheduling strategy for distributed crawling and a load balancing method are proposed in this paper. Internet distance estimating mechanism is adopted as a replacement for large-scale network distance measurement, which not only improves response speed of the system, but also reduces loads on WAN caused by the system. Through deploying crawling nodes at WANs, we built a distributed search engine, and implemented several scheduling strategies. The online experiment shows great improvement in system´s performance.
Keywords
Internet; resource allocation; scheduling; search engines; GNP-based scheduling strategy; Internet distance estimating mechanism; WAN; distributed crawling; distributed search engines; global network positioning; large-scale network distance measurement; load balancing; task scheduling; Computer science; Educational institutions; Fault tolerant systems; History; Information systems; Load management; Logic; Peer to peer computing; Routing; Scalability; GNP; distributed crawling; load balancing; network measurement; scheduling strategies;
fLanguage
English
Publisher
ieee
Conference_Titel
Web Information Systems and Mining, 2009. WISM 2009. International Conference on
Conference_Location
Shanghai
Print_ISBN
978-0-7695-3817-4
Type
conf
DOI
10.1109/WISM.2009.136
Filename
5369005
Link To Document