مرکز منطقه ای اطلاع رساني علوم و فناوري - Crawling Result Pages for Data Extraction Based on URL Classification

DocumentCode :

2260654

Title :

Crawling Result Pages for Data Extraction Based on URL Classification

Author :

Nie, Tiezheng ; Wang, Zhenhua ; Kou, Yue ; Zhang, Rui

Author_Institution :

Key Lab. of Med. Image Comput., Northeastern Univ., Shenyang, China

fYear :

2010

fDate :

20-22 Aug. 2010

Firstpage :

Lastpage :

Abstract :

In Web database integration, crawling data pages is important for data extraction. The fact that data are contained by multiple result pages increases the difficulty of accessing data for integration. Thus, it is necessary to accurately and automatically crawl query result pages from Web database. To address this problem, we propose a novel approach based on URL classification to effectively identify result pages. In our approach, we compute the similarity between URLs of hyperlinks in result pages and classify them into four categories. Each category maps to a set of similar web pages, which separate result pages from others. Then, we use the page probing method to verify the correctness of classification and improve the accuracy of crawled result pages. The experimental result demonstrates that our approach is effective for identifying the collection of result pages in Web database, and can improve the quality and efficiency of data extraction.

Keywords :

Web sites; information retrieval; online front-ends; URL classification; Web database; Web pages; category map; crawling data pages; crawling result pages; data extraction; hyperlinks; Accuracy; Classification algorithms; Clustering algorithms; Data mining; Databases; Web pages; URL; classification; component; data extraction; result pages;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Web Information Systems and Applications Conference (WISA), 2010 7th

Conference_Location :

Hohhot

Print_ISBN :

978-1-4244-8440-9

Type :

conf

DOI :

10.1109/WISA.2010.14

Filename :

5581367

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2260654