DocumentCode :
419374
Title :
Data extraction from Web data sources
Author :
Robinson, Jerome
Author_Institution :
Dept. of Comput. Sci., Essex Univ., Colchester, UK
fYear :
2004
fDate :
30 Aug.-3 Sept. 2004
Firstpage :
282
Lastpage :
288
Abstract :
An explanation is given of the basic data structures used in a new page analysis technique to create wrappers (data extractors) for the result pages produced by Web sites in response to user qeries via Web page forms. The key structure called a tpGrid is a representation of the web page, which is easier to analyse than the raw HTML code. The analysis looks for repetition patterns of sets of tagSets, which are defined in the paper.
Keywords :
Web sites; data structures; grid computing; hypermedia markup languages; information retrieval; HTML code; Web data source; Web page analysis; Web sites; data extraction; data extractor; data structure; repetition patterns; tagSets; tpGrid; wrappers; Computer science; Data mining; Data structures; Databases; HTML; Pattern analysis; Production; Springs; Web pages; Web server;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Database and Expert Systems Applications, 2004. Proceedings. 15th International Workshop on
ISSN :
1529-4188
Print_ISBN :
0-7695-2195-9
Type :
conf
DOI :
10.1109/DEXA.2004.1333487
Filename :
1333487
Link To Document :
بازگشت