DocumentCode :
2277639
Title :
An Approach to Deep Web Crawling by Sampling
Author :
Lu, Jianguo ; Wang, Yan ; Liang, Jie ; Chen, Jessica ; Liu, Jiming
Author_Institution :
Sch. of Comput. Sci., Univ. of Windsor, Windsor, ON
Volume :
1
fYear :
2008
fDate :
9-12 Dec. 2008
Firstpage :
718
Lastpage :
724
Abstract :
Crawling deep web is the process of collecting data from search interfaces by issuing queries. With wide availability of programmable interface encoded in Web services, deep web crawling has received a large variety of applications. One of the major challenges crawling deep web is the selection of the queries so that most of the data can be retrieved at a low cost. We propose a general method in this regard. In order to minimize the duplicates retrieved, we reduced the problem of selecting an optimal set of queries from a sample of the data source into the well-known set-covering problem and adopt a classical algorithm to resolve it. To verify that the queries selected from a sample also produce a good result for the entire data source, we carried out a set of experiments on large corpora including Wikipedia and Reuters. We show that our sampling-based method is effective by empirically proving that 1) The queries selected from samples can harvest most of the data in the original database; 2) The queries with low overlapping rate in samples will also result in a low overlapping rate in the original database; and 3) The size of the sample and the size of the terms from where to select the queries do not need to be very large.
Keywords :
Web services; Web sites; query processing; user interfaces; Reuters; Web services; Wikipedia; data source; deep Web crawling; sampling-based method; search interfaces; Computer science; Costs; Data mining; Databases; Information retrieval; Intelligent agent; Sampling methods; Telecommunication traffic; Uniform resource locators; Web services; deep web;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Web Intelligence and Intelligent Agent Technology, 2008. WI-IAT '08. IEEE/WIC/ACM International Conference on
Conference_Location :
Sydney, NSW
Print_ISBN :
978-0-7695-3496-1
Type :
conf
DOI :
10.1109/WIIAT.2008.392
Filename :
4740535
Link To Document :
بازگشت