DocumentCode :
501186
Title :
The Research of Automatic Extraction Dynamic Web Data
Author :
Jubao, Qu
Author_Institution :
Dept. of Comput. Sci. & Eng., Wuyi Univ., Wuyishan, China
Volume :
2
fYear :
2009
fDate :
15-17 May 2009
Firstpage :
143
Lastpage :
146
Abstract :
The rapid development of the World Wide Web makes it become more and more important sources for people to look for useful data. A substantial fraction of the Web consists of pages that are dynamically generated using a common template populated with data from databases. This paper proposed a novel approach to automatically detecting templates from a set of example pages and extracting data in field level. The objective of the proposed research is to automatically detect the template behind these pages and extract embedded data. The template detection problem is formalized and an analysis of the underlying structure of template-generated pages is made. A template detection approach is presented and the detected templates are used to extract data from instance pages. Experimental results on two large third-party test beds show that the approach can achieve high extraction accuracy.
Keywords :
Internet; database management systems; World Wide Web; automatic extraction dynamic Web data; databases; embedded data extraction; template detection approach; template-generated pages; Application software; Computer science; Data engineering; Data mining; Databases; HTML; Information technology; Skeleton; Web pages; Web sites; automatic; dynamic; extraction; template; web;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Information Technology and Applications, 2009. IFITA '09. International Forum on
Conference_Location :
Chengdu
Print_ISBN :
978-0-7695-3600-2
Type :
conf
DOI :
10.1109/IFITA.2009.211
Filename :
5231252
Link To Document :
بازگشت