The Research of Automatic Extraction Dynamic Web Data

Author

Jubao, Qu

Author_Institution

Dept. of Comput. Sci. & Eng., Wuyi Univ., Wuyishan, China

Volume

2

fYear

2009

fDate

15-17 May 2009

Firstpage

143

Lastpage

146

Abstract

The rapid development of the World Wide Web makes it become more and more important sources for people to look for useful data. A substantial fraction of the Web consists of pages that are dynamically generated using a common template populated with data from databases. This paper proposed a novel approach to automatically detecting templates from a set of example pages and extracting data in field level. The objective of the proposed research is to automatically detect the template behind these pages and extract embedded data. The template detection problem is formalized and an analysis of the underlying structure of template-generated pages is made. A template detection approach is presented and the detected templates are used to extract data from instance pages. Experimental results on two large third-party test beds show that the approach can achieve high extraction accuracy.

Keywords

Internet; database management systems; World Wide Web; automatic extraction dynamic Web data; databases; embedded data extraction; template detection approach; template-generated pages; Application software; Computer science; Data engineering; Data mining; Databases; HTML; Information technology; Skeleton; Web pages; Web sites; automatic; dynamic; extraction; template; web;

fLanguage

English

Publisher

ieee

Conference_Titel

Information Technology and Applications, 2009. IFITA '09. International Forum on

Conference_Location

Chengdu

Print_ISBN

978-0-7695-3600-2

Type

conf

DOI

10.1109/IFITA.2009.211

Filename

5231252