DocumentCode
2120361
Title
A New Architecture of an Intelligent Agent-Based Crawler for Domain-Specific Deep Web Databases
Author
Yanni Li ; Yuping Wang ; Erfeng Tian
Author_Institution
Sch. of Comput. Sci. & Technol., Xidian Univ., Xi´an, China
Volume
1
fYear
2012
fDate
4-7 Dec. 2012
Firstpage
656
Lastpage
663
Abstract
A key problem of retrieving, integrating and mining rich and high quality information from massive Deep Web Databases (WDBs) online is how to automatically and effectively discover and recognize domain-specific WDBs´ entry points, i.e., searchable forms, in the Web. It has been a challenging task because domain-specific WDBs´ forms with dynamic and heterogeneous properties are very sparsely distributed over several trillion Web pages. Although significant efforts have been made to address the problem and its special cases, more intelligent and effective solutions remain to be further explored. In this paper, a new architecture of an intelligent agent-based crawler (iCrawler) for domain-specific Deep Web databases has been proposed to address the limitations of the existing methods. The iCrawler, based on intelligent learning agents and domain ontology, and a series of novel and effective strategies, including a two-step page classifier, a link scoring strategy, etc, can improve the performance of the existing methods. Experiments of the iCrawler over a number of real Web pages in a set of representative domains have been conducted and the results show that the iCrawler outperforms the existing domain-specific Deep Web Form-Focused Crawlers (FFCs) in terms of the harvest rate, coverage rate and time performance.
Keywords
Internet; data mining; database management systems; information retrieval; learning (artificial intelligence); multi-agent systems; pattern classification; FFC; Web pages; coverage rate; domain ontology; domain-specific WDB entry point discovery; domain-specific WDB entry point recognition; domain-specific deep Web database; domain-specific deep Web form-focused crawler; dynamic property; harvest rate; heterogeneous property; iCrawler; information integration; information mining; information retrieval; intelligent agent-based crawler; intelligent learning agents; link scoring strategy; searchable forms; time performance; two-step page classifier; Coverage Rate; Deep Web Databases (WDBs); Form-Focused Crawlers (FFCs); Harvest Rate;
fLanguage
English
Publisher
ieee
Conference_Titel
Web Intelligence and Intelligent Agent Technology (WI-IAT), 2012 IEEE/WIC/ACM International Conferences on
Conference_Location
Macau
Print_ISBN
978-1-4673-6057-9
Type
conf
DOI
10.1109/WI-IAT.2012.103
Filename
6511958
Link To Document