• DocumentCode
    496878
  • Title

    Automatic Data Records Extraction from List Page in Deep Web Sources

  • Author

    Hong-ping, Chen ; Wei, Fang ; Zhou, Yang ; Lin, Zhuo ; Zhi-Ming, Cui

  • Author_Institution
    Inst. of Intell. Inf. Process. & Applic., Soochow Univ., Suzhou, China
  • Volume
    1
  • fYear
    2009
  • fDate
    18-19 July 2009
  • Firstpage
    370
  • Lastpage
    373
  • Abstract
    With the explosive growth and popularity of the World Wide Web, a wealth of online e-commerce information resources becomes available. List pages in these Web sites are usually automatically generated from the back-end DBMS using scripts. In order to provide value-added services and convenience for users, it is very necessary to integrate Web sources of the same domain. Given the huge number of these Web pages, it is difficult and even impossible to use a manual approach to extract data records from these list pages on a large scale. According to characteristics of the template-based list pages, in this paper, we propose a LBDRF algorithm to solve the problem of automatic data records extraction from Web pages in deep Web. Our experimental results show that the proposed method performs well.
  • Keywords
    Web services; Web sites; data mining; database management systems; electronic commerce; hypermedia markup languages; information retrieval; search engines; DOM tree model; LBDRF; Web sites; World Wide Web; back-end DBMS; data mining; data record extraction; deep Web source; document object model; layout-based data region finding; list page; online e-commerce information resource; value-added service; Books; Clustering algorithms; Data mining; Explosives; Information processing; Information resources; Large-scale systems; Search engines; Web pages; Web sites; Data Extraction; Data record; Deep Web;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Information Processing, 2009. APCIP 2009. Asia-Pacific Conference on
  • Conference_Location
    Shenzhen
  • Print_ISBN
    978-0-7695-3699-6
  • Type

    conf

  • DOI
    10.1109/APCIP.2009.100
  • Filename
    5197073