• DocumentCode
    3245421
  • Title

    A Forwarding-Based Task Scheduling Algorithm for Distributed Web Crawling over DHTs

  • Author

    Xu, Xiao ; Zhang, Wei-Zhe ; Zhang, Hong-Li ; Fang, Bin-Xing ; Liu, Xin-Ran

  • Author_Institution
    Sch. of Comput. Sci. & Technol., Harbin Inst. of Technol., Harbin, China
  • fYear
    2009
  • fDate
    8-11 Dec. 2009
  • Firstpage
    854
  • Lastpage
    859
  • Abstract
    Distributed Web crawling (DWC) over DHTs is proposed to solve the bottlenecks in the traditional Web crawling. The core of this kind of system is its fully distributed task scheduling mechanism in which the crawlers are treated as peers and the crawlees are treated as resources maintained by the peers. A system model based on the content addressable network (CAN) can further optimize the scheduling mechanism by exploiting the network proximity of the crawlers and the crawlees. In this paper, we propose a new method for CAN in order to achieve load balancing in the CAN-based DWC system. The method not only keeps the load balancing among peers but also keeps the distance between peers and resources very short in our simulations. The shortened peer-resource distance fulfills the need of shortening crawler-crawlee latencies.
  • Keywords
    Internet; cryptography; resource allocation; scheduling; content addressable network; crawler-crawlee latencies; distributed Web crawling; distributed hash tables; forwarding-based task scheduling algorithm; load balancing; network proximity; Computer networks; Computer science; Crawlers; Delay; Load management; Peer to peer computing; Processor scheduling; Robustness; Scalability; Scheduling algorithm; Content Addressable Network; DHT; distributed Web crawling; task scheduling;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Systems (ICPADS), 2009 15th International Conference on
  • Conference_Location
    Shenzhen
  • ISSN
    1521-9097
  • Print_ISBN
    978-1-4244-5788-5
  • Type

    conf

  • DOI
    10.1109/ICPADS.2009.29
  • Filename
    5395331