Title :
Deep web data extraction
Author_Institution :
Sch. of IT, Monash Univ., Clayton, VIC, Australia
Abstract :
Current automatic wrappers using DOM tree and visual properties of data records to extract the required information from the deep web generally have limitations such as the inability to check the similarity of tree structures accurately. Our study shows that data records located in the deep web do not only share similar visual properties and tree structures, but they are also related semantically in their contents. As such we are able to propose an ontological technique using existing lexical database for English (WordNet) for the extraction of data records from deep web pages. Wrappers designed based on ontological technique are able to reduce the number of potential data regions identified for data extraction, thus improve the data extraction accuracy. In this study, we use visual cue from the underlying browser rendering engine to locate and extract the relevant data region from the deep web by measuring the text and image sizes of data records. Experimental results show that our technique is robust and performs better than the existing state of the art wrappers. Unlike existing ontological based wrappers, our wrapper is domain independent and is able to extract wide range of data records with different structures.
Keywords :
Internet; data analysis; ontologies (artificial intelligence); rendering (computer graphics); tree data structures; DOM tree; automatic wrappers; browser rendering engine; data records; data region; deep Web data extraction; lexical database; ontological based wrappers; tree structures; Ontologies; Solids; Automatic Wrapper; Deep Web; Ontology;
Conference_Titel :
Systems Man and Cybernetics (SMC), 2010 IEEE International Conference on
Conference_Location :
Istanbul
Print_ISBN :
978-1-4244-6586-6
DOI :
10.1109/ICSMC.2010.5642466