Title :
Automatic Web News Content Extraction Based on Similar Pages
Author :
Zhang, Chunyuan ; Lin, Zhiyang
Author_Institution :
Dept. of Comput. Sci., Hainan Univ., Haikou, China
Abstract :
Today most news pages are generated from some underlying structured source, so we think that template-dependent wrappers should be more suitable for them than template-independent wrappers. In this paper, we propose a novel automatic template-dependent Web news content extraction approach based on similar pages. Firstly, We choose two similar pages as training samples and represent them as two HTML DOM trees. Secondly, we create the maximum matching tree between the DOM trees using our simple tree matching and backtracking algorithm. Then, by analyzing the characteristics of nodes in the maximum matching tree, we eliminate the noise nodes to generate an extraction template. Finally, we build a template-dependent wrapper for target news pages whose structures are similar to the samples. Experimental results indicate that our approach is effective and efficient for Web news content extraction, and the average harmonic mean of precision and recall reaches 98.3%.
Keywords :
information resources; tree searching; HTML DOM trees; automatic template-dependent Web news content extraction; backtracking algorithm; extraction template; maximum matching tree; template-dependent wrapper; template-independent wrapper; tree matching; Web news content extraction; similar pages; simple tree matching and backtracking algorithm; template-dependent wrapper;
Conference_Titel :
Web Information Systems and Mining (WISM), 2010 International Conference on
Conference_Location :
Sanya
Print_ISBN :
978-1-4244-8438-6
DOI :
10.1109/WISM.2010.154