مرکز منطقه ای اطلاع رساني علوم و فناوري - Web Content Information Extraction Approach Based on Removing Noise and Content-Features

DocumentCode :

3499668

Title :

Web Content Information Extraction Approach Based on Removing Noise and Content-Features

Author :

Yang, Dingkui ; Song, Jihua

Author_Institution :

Dept. of Inf. Sci. & Technol., Beijing Normal Univ., Beijing, China

Volume :

fYear :

2010

fDate :

23-24 Oct. 2010

Firstpage :

246

Lastpage :

249

Abstract :

This paper presents an improved approach to extract the main content from web pages. There are a good many financial news pages which have so many links that the algorithms mainly based on link density have poor performance in extracting main content. To solve this problem, we put forward an extracting main content method which firstly removes the usual noise and the candidate nodes without any main content information from web pages, and makes use of the relation of content text length, the length of anchor text and the number of punctuation marks to extract the main content. In this paper, we focus on removing noise and utilization of all kinds of content-characteristics, experiments show that this approach can enhance the universality and accuracy in extracting the body text of web pages.

Keywords :

Internet; information retrieval; Web content information extraction; Web pages; anchor text length; content text length; extracting main content method; financial news pages; link density; noise content removal; punctuation mark; information extraction; removing noise content; web page content extraction;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Web Information Systems and Mining (WISM), 2010 International Conference on

Conference_Location :

Sanya

Print_ISBN :

978-1-4244-8438-6

Type :

conf

DOI :

10.1109/WISM.2010.82

Filename :

5662320

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3499668