Title :
Data Extraction Based on Index Path in Web
Author :
Gao, Ya ; Yuan, Fang ; Zhang, Ming
Author_Institution :
Key Lab. in Machine Learning & Comput. Intelligenc, Hebei Univ., Baoding, China
Abstract :
Data extraction in Web is to obtain the desired information to users in Web pages. For a more accurately valuable data extraction, this paper proposes a new method called data extraction based on index path in Web (DEIP) . This approach establishes the index path for each text node using XML DOM; defines the prefix of data-rich by keywords in the index path; generate extraction rule and obtain a wrapper according. The wrapper can extract data automatically in the same domain from a Website. It does relevant to the continuity, the structural similarity, and the location relations of the useful information in Web pages, but not the HTML tag, Experiments indicate that this method is efficient in the recall and the precision of data extraction.
Keywords :
Internet; XML; information retrieval; HTML tag; Web pages; Web site; XML DOM; data extraction; extraction rule; index path; structural similarity; Computer science; Data mining; Databases; HTML; Internet; Search engines; Web page design; Web pages; Web search; XML; DOM; XML; data extraction;
Conference_Titel :
Education Technology and Computer Science (ETCS), 2010 Second International Workshop on
Conference_Location :
Wuhan
Print_ISBN :
978-1-4244-6388-6
Electronic_ISBN :
978-1-4244-6389-3
DOI :
10.1109/ETCS.2010.291