Comparison of Scheduling Algorithms for Domain Specific Web Crawler

Author

Filipowski, Krzysztof

Author_Institution

Dept. of Comput. Syst. & Networks, Wroclaw Univ. of Technol., Wroclaw, Poland

fYear

2014

fDate

29-30 Sept. 2014

Firstpage

69

Lastpage

74

Abstract

Domain-specific Web crawlers are effective tools for acquiring information from the Web. One of the most crucial factors influencing the efficiency of domain crawlers is choice of crawling strategy. This article describes and compares several strategies for domain specific Web crawling. It concentrates particularly on scheduling algorithms which determine order of crawling URLs collected by the crawler. The objective of these strategies is to download the most relevant Web pages in an early stage of the crawl. In the paper there are presented four different algorithms which are compared using several metrics.

Keywords

Internet; Web sites; information retrieval; scheduling; Web pages; domain specific Web crawler; information retrieval; scheduling algorithms; Algorithm design and analysis; Crawlers; Internet; Search engines; Search problems; Uniform resource locators; Web pages; Best N-First Search; Best-First Search; Domain Specific Crawling; Exploration; Information Retrieval;

fLanguage

English

Publisher

ieee

Conference_Titel

Network Intelligence Conference (ENIC), 2014 European

Conference_Location

Wroclaw

Type

conf

DOI

10.1109/ENIC.2014.14

Filename

6984893