Data extraction from Web data sources

Author

Robinson, Jerome

Author_Institution

Dept. of Comput. Sci., Essex Univ., Colchester, UK

fYear

2004

fDate

30 Aug.-3 Sept. 2004

Firstpage

282

Lastpage

288

Abstract

An explanation is given of the basic data structures used in a new page analysis technique to create wrappers (data extractors) for the result pages produced by Web sites in response to user qeries via Web page forms. The key structure called a tpGrid is a representation of the web page, which is easier to analyse than the raw HTML code. The analysis looks for repetition patterns of sets of tagSets, which are defined in the paper.

Keywords

Web sites; data structures; grid computing; hypermedia markup languages; information retrieval; HTML code; Web data source; Web page analysis; Web sites; data extraction; data extractor; data structure; repetition patterns; tagSets; tpGrid; wrappers; Computer science; Data mining; Data structures; Databases; HTML; Pattern analysis; Production; Springs; Web pages; Web server;

fLanguage

English

Publisher

ieee

Conference_Titel

Database and Expert Systems Applications, 2004. Proceedings. 15th International Workshop on

ISSN

1529-4188

Print_ISBN

0-7695-2195-9

Type

conf

DOI

10.1109/DEXA.2004.1333487

Filename

1333487

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=419374