Automatic Data Records Extraction from List Page in Deep Web Sources

Author

Hong-ping, Chen ; Wei, Fang ; Zhou, Yang ; Lin, Zhuo ; Zhi-Ming, Cui

Author_Institution

Inst. of Intell. Inf. Process. & Applic., Soochow Univ., Suzhou, China

Volume

1

fYear

2009

fDate

18-19 July 2009

Firstpage

370

Lastpage

373

Abstract

With the explosive growth and popularity of the World Wide Web, a wealth of online e-commerce information resources becomes available. List pages in these Web sites are usually automatically generated from the back-end DBMS using scripts. In order to provide value-added services and convenience for users, it is very necessary to integrate Web sources of the same domain. Given the huge number of these Web pages, it is difficult and even impossible to use a manual approach to extract data records from these list pages on a large scale. According to characteristics of the template-based list pages, in this paper, we propose a LBDRF algorithm to solve the problem of automatic data records extraction from Web pages in deep Web. Our experimental results show that the proposed method performs well.

Keywords

Web services; Web sites; data mining; database management systems; electronic commerce; hypermedia markup languages; information retrieval; search engines; DOM tree model; LBDRF; Web sites; World Wide Web; back-end DBMS; data mining; data record extraction; deep Web source; document object model; layout-based data region finding; list page; online e-commerce information resource; value-added service; Books; Clustering algorithms; Data mining; Explosives; Information processing; Information resources; Large-scale systems; Search engines; Web pages; Web sites; Data Extraction; Data record; Deep Web;

fLanguage

English

Publisher

ieee

Conference_Titel

Information Processing, 2009. APCIP 2009. Asia-Pacific Conference on

Conference_Location

Shenzhen

Print_ISBN

978-0-7695-3699-6

Type

conf

DOI

10.1109/APCIP.2009.100

Filename

5197073