DocumentCode
2298294
Title
Data Extraction Based on Index Path in Web
Author
Gao, Ya ; Yuan, Fang ; Zhang, Ming
Author_Institution
Key Lab. in Machine Learning & Comput. Intelligenc, Hebei Univ., Baoding, China
Volume
3
fYear
2010
fDate
6-7 March 2010
Firstpage
157
Lastpage
160
Abstract
Data extraction in Web is to obtain the desired information to users in Web pages. For a more accurately valuable data extraction, this paper proposes a new method called data extraction based on index path in Web (DEIP) . This approach establishes the index path for each text node using XML DOM; defines the prefix of data-rich by keywords in the index path; generate extraction rule and obtain a wrapper according. The wrapper can extract data automatically in the same domain from a Website. It does relevant to the continuity, the structural similarity, and the location relations of the useful information in Web pages, but not the HTML tag, Experiments indicate that this method is efficient in the recall and the precision of data extraction.
Keywords
Internet; XML; information retrieval; HTML tag; Web pages; Web site; XML DOM; data extraction; extraction rule; index path; structural similarity; Computer science; Data mining; Databases; HTML; Internet; Search engines; Web page design; Web pages; Web search; XML; DOM; XML; data extraction;
fLanguage
English
Publisher
ieee
Conference_Titel
Education Technology and Computer Science (ETCS), 2010 Second International Workshop on
Conference_Location
Wuhan
Print_ISBN
978-1-4244-6388-6
Electronic_ISBN
978-1-4244-6389-3
Type
conf
DOI
10.1109/ETCS.2010.291
Filename
5459747
Link To Document