DocumentCode :
2553785
Title :
A New Framework for Domain-Specific Hidden Web Crawling Based on Data Extraction Techniques
Author :
El-Desouky, Ali I. ; Ali, Hesham A. ; El-Ghamrawy, S.M.
Author_Institution :
Mansoura Univ., Mansoura
fYear :
2006
fDate :
10-12 Dec. 2006
Firstpage :
1
Lastpage :
1
Abstract :
The World Wide Web continues to grow at an exponential rate which makes exploiting all useful information a standing challenge. Search engines like "Google" crawl and index a large amount of information, ignoring valuable data that represent 80% of the content on the Web, this portion of Web called Hidden Web (HW), they are "Hidden" in databases behind search interfaces. In this paper, a framework of a HW crawler is proposed to crawl and extract hidden Web pages. Two unique features of our framework are 1) the classification phase for grouping HW and Publicly Indexable Web (PIW) pages into distinct classes, so that making our crawler performs well in both the domain-specific and random mode of crawling and 2) the capability of dealing with single-attribute and multi-attribute databases. Three novel algorithms proposed in the framework, one for collecting Web pages, one for identifying relevant forms, and one for extracting labels. The effectiveness of proposed algorithms is evaluated through experiments using real Web sites. The preliminary results are very promising. For instance, one of these algorithms proves to be accurate (over 99% precision and 100 % recall).
Keywords :
Internet; Web sites; database management systems; information retrieval; pattern classification; search engines; Web sites; World Wide Web; data extraction technique; domain-specific hidden Web crawling; hidden Web page classification; publicly indexable Web pages; search engines; single/multiattribute databases; Crawlers; Data engineering; Data mining; HTML; Indexes; Search engines; Spatial databases; Systems engineering and theory; Web pages; Web sites; Crawling; HTML Forms; Hidden Web; Search Engines; Web Information Extraction;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Information & Communications Technology, 2006. ICICT '06. ITI 4th International Conference on
Conference_Location :
Cairo
Print_ISBN :
0-7803-9770-3
Type :
conf
DOI :
10.1109/ITICT.2006.358295
Filename :
4196519
Link To Document :
بازگشت