DocumentCode :
2260654
Title :
Crawling Result Pages for Data Extraction Based on URL Classification
Author :
Nie, Tiezheng ; Wang, Zhenhua ; Kou, Yue ; Zhang, Rui
Author_Institution :
Key Lab. of Med. Image Comput., Northeastern Univ., Shenyang, China
fYear :
2010
fDate :
20-22 Aug. 2010
Firstpage :
79
Lastpage :
84
Abstract :
In Web database integration, crawling data pages is important for data extraction. The fact that data are contained by multiple result pages increases the difficulty of accessing data for integration. Thus, it is necessary to accurately and automatically crawl query result pages from Web database. To address this problem, we propose a novel approach based on URL classification to effectively identify result pages. In our approach, we compute the similarity between URLs of hyperlinks in result pages and classify them into four categories. Each category maps to a set of similar web pages, which separate result pages from others. Then, we use the page probing method to verify the correctness of classification and improve the accuracy of crawled result pages. The experimental result demonstrates that our approach is effective for identifying the collection of result pages in Web database, and can improve the quality and efficiency of data extraction.
Keywords :
Web sites; information retrieval; online front-ends; URL classification; Web database; Web pages; category map; crawling data pages; crawling result pages; data extraction; hyperlinks; Accuracy; Classification algorithms; Clustering algorithms; Data mining; Databases; Web pages; URL; classification; component; data extraction; result pages;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Web Information Systems and Applications Conference (WISA), 2010 7th
Conference_Location :
Hohhot
Print_ISBN :
978-1-4244-8440-9
Type :
conf
DOI :
10.1109/WISA.2010.14
Filename :
5581367
Link To Document :
بازگشت