Data Extraction Based on Index Path in Web

Author

Gao, Ya ; Yuan, Fang ; Zhang, Ming

Author_Institution

Key Lab. in Machine Learning & Comput. Intelligenc, Hebei Univ., Baoding, China

Volume

3

fYear

2010

fDate

6-7 March 2010

Firstpage

157

Lastpage

160

Abstract

Data extraction in Web is to obtain the desired information to users in Web pages. For a more accurately valuable data extraction, this paper proposes a new method called data extraction based on index path in Web (DEIP) . This approach establishes the index path for each text node using XML DOM; defines the prefix of data-rich by keywords in the index path; generate extraction rule and obtain a wrapper according. The wrapper can extract data automatically in the same domain from a Website. It does relevant to the continuity, the structural similarity, and the location relations of the useful information in Web pages, but not the HTML tag, Experiments indicate that this method is efficient in the recall and the precision of data extraction.

Keywords

Internet; XML; information retrieval; HTML tag; Web pages; Web site; XML DOM; data extraction; extraction rule; index path; structural similarity; Computer science; Data mining; Databases; HTML; Internet; Search engines; Web page design; Web pages; Web search; XML; DOM; XML; data extraction;

fLanguage

English

Publisher

ieee

Conference_Titel

Education Technology and Computer Science (ETCS), 2010 Second International Workshop on

Conference_Location

Wuhan

Print_ISBN

978-1-4244-6388-6

Electronic_ISBN

978-1-4244-6389-3

Type

conf

DOI

10.1109/ETCS.2010.291

Filename

5459747