• DocumentCode
    2028300
  • Title

    A novel approach for Web data extraction based on XML encoding

  • Author

    Nie, Tiezheng ; Shen, Derong ; Yu, Ge ; Shi, Zhong

  • Author_Institution
    Key Lab. of Med. Image Comput., Northeastern Univ., Shenyang, China
  • Volume
    5
  • fYear
    2010
  • fDate
    10-12 Aug. 2010
  • Firstpage
    2417
  • Lastpage
    2421
  • Abstract
    The problem of extracting data from a Web page has been studied by many works. In this paper, we present a novel approach that extracts data records from Web pages based on techniques of XML encoding. Firstly, our approach formats a given Web data page into an XML document. Then instead of using DOM-based approaches, we make use of XML encoding model to transform the XML document into a linear sequence. Our algorithm identifies the data records of a Web page from the sequence, which avoids the complex matching between sub trees in DOM model. Moreover, we address the problem of repetitive subparts in records and propose an algorithm for data alignment. Experimental results show that our approach can extract data records accurately from web pages.
  • Keywords
    Web sites; XML; encoding; DOM-based approach; Web data extraction; Web pages; XML document; XML encoding; data alignment; linear sequence; subtrees; Algorithm design and analysis; Data mining; Encoding; Feature extraction; HTML; Web pages; XML; Web data; data extraction; xml encoding;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Fuzzy Systems and Knowledge Discovery (FSKD), 2010 Seventh International Conference on
  • Conference_Location
    Yantai, Shandong
  • Print_ISBN
    978-1-4244-5931-5
  • Type

    conf

  • DOI
    10.1109/FSKD.2010.5569297
  • Filename
    5569297