DocumentCode :
3490253
Title :
Web Data Extraction Based on Tree Structure Analysis and Template Generation
Author :
Hong, Haikun ; Chen, Xiaoxin ; Wu, Guoshi ; Li, Jing
Author_Institution :
Sch. of Software Eng., Beijing Univ. of Posts & Telecommun., Beijing, China
fYear :
2010
fDate :
7-9 Nov. 2010
Firstpage :
1
Lastpage :
5
Abstract :
This paper studies the problem of extracting data from large numbers of semi-structured web pages. The fact that many websites have enormous pages generated dynamically from a underlying structured source like a database makes it feasible to induct a common template for similar web pages and then extract data accordingly. Previous work on this problem has limited practical utility because of either requiring significant human efforts or basing on several brittle assumptions. We propose a three-step approach, including template generation, template detection and data extraction, with a little human intervention in template edit. The core algorithm is based on two highly efficient tree structure analysis techniques. Experimental results show that our approach can extract web data in a high accuracy and flexibility.
Keywords :
Internet; Web sites; data handling; information retrieval; tree data structures; Web data extraction; Website; human intervention; semistructured web page; template detection; template generation; tree structure analysis; Clustering algorithms; Data mining; HTML; Humans; Web pages; XML;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
E-Product E-Service and E-Entertainment (ICEEE), 2010 International Conference on
Conference_Location :
Henan
Print_ISBN :
978-1-4244-7159-1
Type :
conf
DOI :
10.1109/ICEEE.2010.5661561
Filename :
5661561
Link To Document :
بازگشت