Title :
The Research of Automatic Extraction Dynamic Web Data
Author_Institution :
Dept. of Comput. Sci. & Eng., Wuyi Univ., Wuyishan, China
Abstract :
The rapid development of the World Wide Web makes it become more and more important sources for people to look for useful data. A substantial fraction of the Web consists of pages that are dynamically generated using a common template populated with data from databases. This paper proposed a novel approach to automatically detecting templates from a set of example pages and extracting data in field level. The objective of the proposed research is to automatically detect the template behind these pages and extract embedded data. The template detection problem is formalized and an analysis of the underlying structure of template-generated pages is made. A template detection approach is presented and the detected templates are used to extract data from instance pages. Experimental results on two large third-party test beds show that the approach can achieve high extraction accuracy.
Keywords :
Internet; database management systems; World Wide Web; automatic extraction dynamic Web data; databases; embedded data extraction; template detection approach; template-generated pages; Application software; Computer science; Data engineering; Data mining; Databases; HTML; Information technology; Skeleton; Web pages; Web sites; automatic; dynamic; extraction; template; web;
Conference_Titel :
Information Technology and Applications, 2009. IFITA '09. International Forum on
Conference_Location :
Chengdu
Print_ISBN :
978-0-7695-3600-2
DOI :
10.1109/IFITA.2009.211