Title :
OntoWrap- extracting data records from search engine results pages using ontological technique
Author :
Hong, Jer Lang ; Siew, Eu-Gene ; Egerton, Simon
Author_Institution :
Sch. of Inf. Technol., Monash Univ., Melbourne, VIC, Australia
Abstract :
Current automatic wrappers using DOM tree and visual properties of data records to extract the required information from the search engine results pages generally have limitations such as the inability to check the similarity of tree structures accurately. Our study on the properties of data records shows that these data records located in search engine results pages are not only having similar visual properties and tree structures, but they are also related semantically in their contents. In this context, we propose an ontological technique using existing lexical database for English (WordNet) for the extraction of data records. We find that wrappers designed based on ontological technique are able to reduce the number of potential data regions to be extracted, thus they are able to improve the data extraction accuracy. We then use visual cue from the browser rendering engine to locate and extract the relevant data region from the web page by measuring the size of text and image of data records. Experimental results indicate that our technique is robust and performs better than the existing state of the art visual based wrappers.
Keywords :
ontologies (artificial intelligence); search engines; text analysis; DOM tree; Web page; WordNet; browser rendering engine; data records extraction; lexical database; ontological technique; search engine results pages; Data mining; HTML; Information technology; Metasearch; Ontologies; Rendering (computer graphics); Search engines; Tree data structures; Visual databases; Web pages; Automatic Wrapper; Ontology domain; Search engine results pages;
Conference_Titel :
Information Retrieval & Knowledge Management, (CAMP), 2010 International Conference on
Conference_Location :
Shah Alam, Selangor
Print_ISBN :
978-1-4244-5650-5
DOI :
10.1109/INFRKM.2010.5466936