DocumentCode
2553785
Title
A New Framework for Domain-Specific Hidden Web Crawling Based on Data Extraction Techniques
Author
El-Desouky, Ali I. ; Ali, Hesham A. ; El-Ghamrawy, S.M.
Author_Institution
Mansoura Univ., Mansoura
fYear
2006
fDate
10-12 Dec. 2006
Firstpage
1
Lastpage
1
Abstract
The World Wide Web continues to grow at an exponential rate which makes exploiting all useful information a standing challenge. Search engines like "Google" crawl and index a large amount of information, ignoring valuable data that represent 80% of the content on the Web, this portion of Web called Hidden Web (HW), they are "Hidden" in databases behind search interfaces. In this paper, a framework of a HW crawler is proposed to crawl and extract hidden Web pages. Two unique features of our framework are 1) the classification phase for grouping HW and Publicly Indexable Web (PIW) pages into distinct classes, so that making our crawler performs well in both the domain-specific and random mode of crawling and 2) the capability of dealing with single-attribute and multi-attribute databases. Three novel algorithms proposed in the framework, one for collecting Web pages, one for identifying relevant forms, and one for extracting labels. The effectiveness of proposed algorithms is evaluated through experiments using real Web sites. The preliminary results are very promising. For instance, one of these algorithms proves to be accurate (over 99% precision and 100 % recall).
Keywords
Internet; Web sites; database management systems; information retrieval; pattern classification; search engines; Web sites; World Wide Web; data extraction technique; domain-specific hidden Web crawling; hidden Web page classification; publicly indexable Web pages; search engines; single/multiattribute databases; Crawlers; Data engineering; Data mining; HTML; Indexes; Search engines; Spatial databases; Systems engineering and theory; Web pages; Web sites; Crawling; HTML Forms; Hidden Web; Search Engines; Web Information Extraction;
fLanguage
English
Publisher
ieee
Conference_Titel
Information & Communications Technology, 2006. ICICT '06. ITI 4th International Conference on
Conference_Location
Cairo
Print_ISBN
0-7803-9770-3
Type
conf
DOI
10.1109/ITICT.2006.358295
Filename
4196519
Link To Document