DocumentCode :
3190171
Title :
Deep web data extraction
Author :
Hong, Jer Lang
Author_Institution :
Sch. of IT, Monash Univ., Clayton, VIC, Australia
fYear :
2010
fDate :
10-13 Oct. 2010
Firstpage :
3420
Lastpage :
3427
Abstract :
Current automatic wrappers using DOM tree and visual properties of data records to extract the required information from the deep web generally have limitations such as the inability to check the similarity of tree structures accurately. Our study shows that data records located in the deep web do not only share similar visual properties and tree structures, but they are also related semantically in their contents. As such we are able to propose an ontological technique using existing lexical database for English (WordNet) for the extraction of data records from deep web pages. Wrappers designed based on ontological technique are able to reduce the number of potential data regions identified for data extraction, thus improve the data extraction accuracy. In this study, we use visual cue from the underlying browser rendering engine to locate and extract the relevant data region from the deep web by measuring the text and image sizes of data records. Experimental results show that our technique is robust and performs better than the existing state of the art wrappers. Unlike existing ontological based wrappers, our wrapper is domain independent and is able to extract wide range of data records with different structures.
Keywords :
Internet; data analysis; ontologies (artificial intelligence); rendering (computer graphics); tree data structures; DOM tree; automatic wrappers; browser rendering engine; data records; data region; deep Web data extraction; lexical database; ontological based wrappers; tree structures; Ontologies; Solids; Automatic Wrapper; Deep Web; Ontology;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Systems Man and Cybernetics (SMC), 2010 IEEE International Conference on
Conference_Location :
Istanbul
ISSN :
1062-922X
Print_ISBN :
978-1-4244-6586-6
Type :
conf
DOI :
10.1109/ICSMC.2010.5642466
Filename :
5642466
Link To Document :
بازگشت