مرکز منطقه ای اطلاع رساني علوم و فناوري - Crawling for domain-specific hidden Web resources

DocumentCode :

2418395

Title :

Crawling for domain-specific hidden Web resources

Author :

Bergholz, André ; Childlovskii, B.

Author_Institution :

Xerox Res. Center Eur., Meylan, France

fYear :

2003

fDate :

10-12 Dec. 2003

Firstpage :

125

Lastpage :

133

Abstract :

The Hidden Web, the part of the Web that remains unavailable for standard crawlers, has become an important research topic during recent years. Its size is estimated to 400 to 500 times larger than that of the publicly indexable Web (PIW). Furthermore, the information on the hidden Web is assumed to be more structured, because it is usually stored in databases. In this paper, we describe a crawler which starting from the PIW finds entry points into the hidden Web. The crawler is domain-specific and is initialized with pre-classified documents and relevant keywords. We describe our approach to the automatic identification of Hidden Web resources among encountered HTML forms. We conduct a series of experiments using the top-level categories in the Google directory and report our analysis of the discovered Hidden Web resources.

Keywords :

Internet; hypermedia markup languages; information resources; information retrieval; online front-ends; search engines; Google directory; HTML forms; Hidden Web resources; Web crawlers; Web pages; automatic identification; information searching; pre-classified documents; publicly indexable Web; relevant keywords; search engines; Computer science; Crawlers; Databases; Europe; HTML; Humans; Information analysis; Information resources; Search engines; Web pages;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Web Information Systems Engineering, 2003. WISE 2003. Proceedings of the Fourth International Conference on

Print_ISBN :

0-7695-1999-7

Type :

conf

DOI :

10.1109/WISE.2003.1254476

Filename :

1254476

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2418395