• DocumentCode
    2909926
  • Title

    A GNP-Based Scheduling Strategy for Distributed Crawling

  • Author

    Liu, Shuang ; Xu, Xiao ; Li, Dong ; Zhang, Wei-Zhe ; Liu, Xin-Ran

  • Author_Institution
    Dept. of Comput. Sci. & Technol., Harbin Inst. of Technol., Harbin, China
  • fYear
    2009
  • fDate
    7-8 Nov. 2009
  • Firstpage
    651
  • Lastpage
    655
  • Abstract
    In order to solve task scheduling and load balancing problems of distributed search engines, a GNP-based scheduling strategy for distributed crawling and a load balancing method are proposed in this paper. Internet distance estimating mechanism is adopted as a replacement for large-scale network distance measurement, which not only improves response speed of the system, but also reduces loads on WAN caused by the system. Through deploying crawling nodes at WANs, we built a distributed search engine, and implemented several scheduling strategies. The online experiment shows great improvement in system´s performance.
  • Keywords
    Internet; resource allocation; scheduling; search engines; GNP-based scheduling strategy; Internet distance estimating mechanism; WAN; distributed crawling; distributed search engines; global network positioning; large-scale network distance measurement; load balancing; task scheduling; Computer science; Educational institutions; Fault tolerant systems; History; Information systems; Load management; Logic; Peer to peer computing; Routing; Scalability; GNP; distributed crawling; load balancing; network measurement; scheduling strategies;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Web Information Systems and Mining, 2009. WISM 2009. International Conference on
  • Conference_Location
    Shanghai
  • Print_ISBN
    978-0-7695-3817-4
  • Type

    conf

  • DOI
    10.1109/WISM.2009.136
  • Filename
    5369005