Title :
Web Data Extraction Based on Tree Structure Analysis and Template Generation
Author :
Hong, Haikun ; Chen, Xiaoxin ; Wu, Guoshi ; Li, Jing
Author_Institution :
Sch. of Software Eng., Beijing Univ. of Posts & Telecommun., Beijing, China
Abstract :
This paper studies the problem of extracting data from large numbers of semi-structured web pages. The fact that many websites have enormous pages generated dynamically from a underlying structured source like a database makes it feasible to induct a common template for similar web pages and then extract data accordingly. Previous work on this problem has limited practical utility because of either requiring significant human efforts or basing on several brittle assumptions. We propose a three-step approach, including template generation, template detection and data extraction, with a little human intervention in template edit. The core algorithm is based on two highly efficient tree structure analysis techniques. Experimental results show that our approach can extract web data in a high accuracy and flexibility.
Keywords :
Internet; Web sites; data handling; information retrieval; tree data structures; Web data extraction; Website; human intervention; semistructured web page; template detection; template generation; tree structure analysis; Clustering algorithms; Data mining; HTML; Humans; Web pages; XML;
Conference_Titel :
E-Product E-Service and E-Entertainment (ICEEE), 2010 International Conference on
Conference_Location :
Henan
Print_ISBN :
978-1-4244-7159-1
DOI :
10.1109/ICEEE.2010.5661561