DocumentCode
2028300
Title
A novel approach for Web data extraction based on XML encoding
Author
Nie, Tiezheng ; Shen, Derong ; Yu, Ge ; Shi, Zhong
Author_Institution
Key Lab. of Med. Image Comput., Northeastern Univ., Shenyang, China
Volume
5
fYear
2010
fDate
10-12 Aug. 2010
Firstpage
2417
Lastpage
2421
Abstract
The problem of extracting data from a Web page has been studied by many works. In this paper, we present a novel approach that extracts data records from Web pages based on techniques of XML encoding. Firstly, our approach formats a given Web data page into an XML document. Then instead of using DOM-based approaches, we make use of XML encoding model to transform the XML document into a linear sequence. Our algorithm identifies the data records of a Web page from the sequence, which avoids the complex matching between sub trees in DOM model. Moreover, we address the problem of repetitive subparts in records and propose an algorithm for data alignment. Experimental results show that our approach can extract data records accurately from web pages.
Keywords
Web sites; XML; encoding; DOM-based approach; Web data extraction; Web pages; XML document; XML encoding; data alignment; linear sequence; subtrees; Algorithm design and analysis; Data mining; Encoding; Feature extraction; HTML; Web pages; XML; Web data; data extraction; xml encoding;
fLanguage
English
Publisher
ieee
Conference_Titel
Fuzzy Systems and Knowledge Discovery (FSKD), 2010 Seventh International Conference on
Conference_Location
Yantai, Shandong
Print_ISBN
978-1-4244-5931-5
Type
conf
DOI
10.1109/FSKD.2010.5569297
Filename
5569297
Link To Document