مرکز منطقه ای اطلاع رساني علوم و فناوري

DocumentCode :

719141

Title :

The anatomy of web crawlers

Author :

Sharma, Shruti ; Gupta, Parul

Author_Institution :

Deptt of Comput. Eng., YMCA Univ. of Sci. & Technol., Faridabad, India

fYear :

2015

fDate :

15-16 May 2015

Firstpage :

849

Lastpage :

853

Abstract :

World Wide Web (www) is the gigantic and richest source of information. To retrieve the information from this imperative resource, Search Engines are generally used. For this purpose these Search engines rely on massive collections of web pages that have been downloaded by web crawlers. A Web crawler is a program that traverses the web by following the ever changing, dense and distributed hyperlinked structure and thereafter storing downloaded pages in a large repository which is later indexed for efficient execution of user queries. Thus, web crawlers are becoming increasingly important. Various web crawling architectures have been proposed in recent years. In this paper a survey of different architectures of web crawlers along with their comparisons has been carried out that takes into account various important features like scalability, manageability, page refresh policy, politeness policy etc.

Keywords :

Internet; Web sites; query processing; search engines; WWW; Web crawler anatomy; Web crawling architectures; Web pages; World Wide Web; distributed hyperlinked structure; imperative resource; page refresh policy; politeness policy; search engines; user queries; Automation; Computer architecture; Crawlers; Search engines; Uniform resource locators; Web pages; World Wide Web; Focused Crawler; Hidden Web Crawler; Parallel Crawler; Web crawler;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Computing, Communication & Automation (ICCCA), 2015 International Conference on

Conference_Location :

Noida

Print_ISBN :

978-1-4799-8889-1

Type :

conf

DOI :

10.1109/CCAA.2015.7148493

Filename :

7148493

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=719141