DocumentCode :
2120361
Title :
A New Architecture of an Intelligent Agent-Based Crawler for Domain-Specific Deep Web Databases
Author :
Yanni Li ; Yuping Wang ; Erfeng Tian
Author_Institution :
Sch. of Comput. Sci. & Technol., Xidian Univ., Xi´an, China
Volume :
1
fYear :
2012
fDate :
4-7 Dec. 2012
Firstpage :
656
Lastpage :
663
Abstract :
A key problem of retrieving, integrating and mining rich and high quality information from massive Deep Web Databases (WDBs) online is how to automatically and effectively discover and recognize domain-specific WDBs´ entry points, i.e., searchable forms, in the Web. It has been a challenging task because domain-specific WDBs´ forms with dynamic and heterogeneous properties are very sparsely distributed over several trillion Web pages. Although significant efforts have been made to address the problem and its special cases, more intelligent and effective solutions remain to be further explored. In this paper, a new architecture of an intelligent agent-based crawler (iCrawler) for domain-specific Deep Web databases has been proposed to address the limitations of the existing methods. The iCrawler, based on intelligent learning agents and domain ontology, and a series of novel and effective strategies, including a two-step page classifier, a link scoring strategy, etc, can improve the performance of the existing methods. Experiments of the iCrawler over a number of real Web pages in a set of representative domains have been conducted and the results show that the iCrawler outperforms the existing domain-specific Deep Web Form-Focused Crawlers (FFCs) in terms of the harvest rate, coverage rate and time performance.
Keywords :
Internet; data mining; database management systems; information retrieval; learning (artificial intelligence); multi-agent systems; pattern classification; FFC; Web pages; coverage rate; domain ontology; domain-specific WDB entry point discovery; domain-specific WDB entry point recognition; domain-specific deep Web database; domain-specific deep Web form-focused crawler; dynamic property; harvest rate; heterogeneous property; iCrawler; information integration; information mining; information retrieval; intelligent agent-based crawler; intelligent learning agents; link scoring strategy; searchable forms; time performance; two-step page classifier; Coverage Rate; Deep Web Databases (WDBs); Form-Focused Crawlers (FFCs); Harvest Rate;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Web Intelligence and Intelligent Agent Technology (WI-IAT), 2012 IEEE/WIC/ACM International Conferences on
Conference_Location :
Macau
Print_ISBN :
978-1-4673-6057-9
Type :
conf
DOI :
10.1109/WI-IAT.2012.103
Filename :
6511958
Link To Document :
بازگشت