Title :
Improve the Performance of the Webpage Content Extraction Using Webpage Segmentation Algorithm
Author :
Lei, Fu ; Yao, Meng ; Hao, Yu
Author_Institution :
Fujitsu R&D Center CO., Ltd., Beijing, China
Abstract :
In this paper, we present a method using Webpage segmentation algorithm to improve the performance of the Webpage content extraction. The traditional methods often depend on parsing the DOM tree of the Webpage and judging each node of the DOM tree to determine which node is the text node, this kind of method has a potential problem, it sometimes throws part of the content away because of its local judgement strategy. But our method which is based on the VIPS (vision-based page segmentation) algorithm, can solve the problem satisfactorily, it can extract the content according to the coordinate information of the block and help the traditional method to recall the lost part of the content.
Keywords :
Web sites; information retrieval; trees (mathematics); DOM tree analysis; Web page content extraction; Web page segmentation algorithm; vision-based page segmentation algorithm; Application software; Computer applications; Data mining; Explosions; HTML; Keyword search; Particle separators; Research and development; Web pages; Web sites; DOM tree analysis; VIPS; Webpage Content Extraction; Webpage Segmentation;
Conference_Titel :
Computer Science-Technology and Applications, 2009. IFCSTA '09. International Forum on
Conference_Location :
Chongqing
Print_ISBN :
978-0-7695-3930-0
Electronic_ISBN :
978-1-4244-5423-5
DOI :
10.1109/IFCSTA.2009.84