مرکز منطقه ای اطلاع رساني علوم و فناوري - Web Data Extraction Based on Tree Structure Analysis and Template Generation

DocumentCode :

3490253

Title :

Web Data Extraction Based on Tree Structure Analysis and Template Generation

Author :

Hong, Haikun ; Chen, Xiaoxin ; Wu, Guoshi ; Li, Jing

Author_Institution :

Sch. of Software Eng., Beijing Univ. of Posts & Telecommun., Beijing, China

fYear :

2010

fDate :

7-9 Nov. 2010

Firstpage :

Lastpage :

Abstract :

This paper studies the problem of extracting data from large numbers of semi-structured web pages. The fact that many websites have enormous pages generated dynamically from a underlying structured source like a database makes it feasible to induct a common template for similar web pages and then extract data accordingly. Previous work on this problem has limited practical utility because of either requiring significant human efforts or basing on several brittle assumptions. We propose a three-step approach, including template generation, template detection and data extraction, with a little human intervention in template edit. The core algorithm is based on two highly efficient tree structure analysis techniques. Experimental results show that our approach can extract web data in a high accuracy and flexibility.

Keywords :

Internet; Web sites; data handling; information retrieval; tree data structures; Web data extraction; Website; human intervention; semistructured web page; template detection; template generation; tree structure analysis; Clustering algorithms; Data mining; HTML; Humans; Web pages; XML;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

E-Product E-Service and E-Entertainment (ICEEE), 2010 International Conference on

Conference_Location :

Henan

Print_ISBN :

978-1-4244-7159-1

Type :

conf

DOI :

10.1109/ICEEE.2010.5661561

Filename :

5661561

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3490253