مرکز منطقه ای اطلاع رساني علوم و فناوري

DocumentCode :

3190171

Title :

Deep web data extraction

Author :

Hong, Jer Lang

Author_Institution :

Sch. of IT, Monash Univ., Clayton, VIC, Australia

fYear :

2010

fDate :

10-13 Oct. 2010

Firstpage :

3420

Lastpage :

3427

Abstract :

Current automatic wrappers using DOM tree and visual properties of data records to extract the required information from the deep web generally have limitations such as the inability to check the similarity of tree structures accurately. Our study shows that data records located in the deep web do not only share similar visual properties and tree structures, but they are also related semantically in their contents. As such we are able to propose an ontological technique using existing lexical database for English (WordNet) for the extraction of data records from deep web pages. Wrappers designed based on ontological technique are able to reduce the number of potential data regions identified for data extraction, thus improve the data extraction accuracy. In this study, we use visual cue from the underlying browser rendering engine to locate and extract the relevant data region from the deep web by measuring the text and image sizes of data records. Experimental results show that our technique is robust and performs better than the existing state of the art wrappers. Unlike existing ontological based wrappers, our wrapper is domain independent and is able to extract wide range of data records with different structures.

Keywords :

Internet; data analysis; ontologies (artificial intelligence); rendering (computer graphics); tree data structures; DOM tree; automatic wrappers; browser rendering engine; data records; data region; deep Web data extraction; lexical database; ontological based wrappers; tree structures; Ontologies; Solids; Automatic Wrapper; Deep Web; Ontology;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Systems Man and Cybernetics (SMC), 2010 IEEE International Conference on

Conference_Location :

Istanbul

ISSN :

1062-922X

Print_ISBN :

978-1-4244-6586-6

Type :

conf

DOI :

10.1109/ICSMC.2010.5642466

Filename :

5642466

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3190171