Title :
Web Content Information Extraction Approach Based on Removing Noise and Content-Features
Author :
Yang, Dingkui ; Song, Jihua
Author_Institution :
Dept. of Inf. Sci. & Technol., Beijing Normal Univ., Beijing, China
Abstract :
This paper presents an improved approach to extract the main content from web pages. There are a good many financial news pages which have so many links that the algorithms mainly based on link density have poor performance in extracting main content. To solve this problem, we put forward an extracting main content method which firstly removes the usual noise and the candidate nodes without any main content information from web pages, and makes use of the relation of content text length, the length of anchor text and the number of punctuation marks to extract the main content. In this paper, we focus on removing noise and utilization of all kinds of content-characteristics, experiments show that this approach can enhance the universality and accuracy in extracting the body text of web pages.
Keywords :
Internet; information retrieval; Web content information extraction; Web pages; anchor text length; content text length; extracting main content method; financial news pages; link density; noise content removal; punctuation mark; information extraction; removing noise content; web page content extraction;
Conference_Titel :
Web Information Systems and Mining (WISM), 2010 International Conference on
Conference_Location :
Sanya
Print_ISBN :
978-1-4244-8438-6
DOI :
10.1109/WISM.2010.82