Automatic Web News Content Extraction Based on Similar Pages

Author

Zhang, Chunyuan ; Lin, Zhiyang

Author_Institution

Dept. of Comput. Sci., Hainan Univ., Haikou, China

Volume

1

fYear

2010

fDate

23-24 Oct. 2010

Firstpage

232

Lastpage

236

Abstract

Today most news pages are generated from some underlying structured source, so we think that template-dependent wrappers should be more suitable for them than template-independent wrappers. In this paper, we propose a novel automatic template-dependent Web news content extraction approach based on similar pages. Firstly, We choose two similar pages as training samples and represent them as two HTML DOM trees. Secondly, we create the maximum matching tree between the DOM trees using our simple tree matching and backtracking algorithm. Then, by analyzing the characteristics of nodes in the maximum matching tree, we eliminate the noise nodes to generate an extraction template. Finally, we build a template-dependent wrapper for target news pages whose structures are similar to the samples. Experimental results indicate that our approach is effective and efficient for Web news content extraction, and the average harmonic mean of precision and recall reaches 98.3%.

Keywords

information resources; tree searching; HTML DOM trees; automatic template-dependent Web news content extraction; backtracking algorithm; extraction template; maximum matching tree; template-dependent wrapper; template-independent wrapper; tree matching; Web news content extraction; similar pages; simple tree matching and backtracking algorithm; template-dependent wrapper;

fLanguage

English

Publisher

ieee

Conference_Titel

Web Information Systems and Mining (WISM), 2010 International Conference on

Conference_Location

Sanya

Print_ISBN

978-1-4244-8438-6

Type

conf

DOI

10.1109/WISM.2010.154

Filename

5662317