Title :
Automatic Data Records Extraction from List Page in Deep Web Sources
Author :
Hong-ping, Chen ; Wei, Fang ; Zhou, Yang ; Lin, Zhuo ; Zhi-Ming, Cui
Author_Institution :
Inst. of Intell. Inf. Process. & Applic., Soochow Univ., Suzhou, China
Abstract :
With the explosive growth and popularity of the World Wide Web, a wealth of online e-commerce information resources becomes available. List pages in these Web sites are usually automatically generated from the back-end DBMS using scripts. In order to provide value-added services and convenience for users, it is very necessary to integrate Web sources of the same domain. Given the huge number of these Web pages, it is difficult and even impossible to use a manual approach to extract data records from these list pages on a large scale. According to characteristics of the template-based list pages, in this paper, we propose a LBDRF algorithm to solve the problem of automatic data records extraction from Web pages in deep Web. Our experimental results show that the proposed method performs well.
Keywords :
Web services; Web sites; data mining; database management systems; electronic commerce; hypermedia markup languages; information retrieval; search engines; DOM tree model; LBDRF; Web sites; World Wide Web; back-end DBMS; data mining; data record extraction; deep Web source; document object model; layout-based data region finding; list page; online e-commerce information resource; value-added service; Books; Clustering algorithms; Data mining; Explosives; Information processing; Information resources; Large-scale systems; Search engines; Web pages; Web sites; Data Extraction; Data record; Deep Web;
Conference_Titel :
Information Processing, 2009. APCIP 2009. Asia-Pacific Conference on
Conference_Location :
Shenzhen
Print_ISBN :
978-0-7695-3699-6
DOI :
10.1109/APCIP.2009.100