مرکز منطقه ای اطلاع رساني علوم و فناوري

DocumentCode :

2029675

Title :

Extracting Multi-Records from Web Pages

Author :

Xia, Tian

Author_Institution :

Key Lab. of Data Eng. & Knowledge Eng., Renmin Univ. of China, Beijing, China

fYear :

2008

fDate :

3-5 Dec. 2008

Firstpage :

396

Lastpage :

399

Abstract :

Extracting multi-records from web pages is useful, it allows us to integrate information from multiple sources to provide value-added services. Existing techniques still have some limitations because of their several restrictions and accuracy. This paper proposes a new method to perform multi-records extraction task automatically. Firstly, the HTML tag tree is build based on an embedded browser interface to solve the AJAX problem. Secondly, data regions are found out by data chunk comparison, and simple tree matching method is proposed to compute the chunk similarity. Finally, the main data region is determined and the multi-records are extracted out. Experimental results show that our method dramatically outperforms other existing methods, and it can extract multi-records from pages very accurately.

Keywords :

Web sites; data structures; hypermedia markup languages; online front-ends; user interfaces; HTML tag tree; Web pages; chunk similarity; data regions; embedded browser interface; multirecords extraction; tree matching method; value-added services; Cascading style sheets; Conference management; Data engineering; Data mining; HTML; Information resources; Java; Knowledge engineering; Laboratories; Web pages; Multi-records extraction; Web2.0; tree similarity; wrapper generation;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Semantics, Knowledge and Grid, 2008. SKG '08. Fourth International Conference on

Conference_Location :

Beijing

Print_ISBN :

978-0-7695-3401-5

Electronic_ISBN :

978-0-7695-3401-5

Type :

conf

DOI :

10.1109/SKG.2008.47

Filename :

4725947

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2029675