مرکز منطقه ای اطلاع رساني علوم و فناوري - Efficient Web Page Main Text Extraction towards Online News Analysis

DocumentCode :

2520398

Title :

Efficient Web Page Main Text Extraction towards Online News Analysis

Author :

Zhou, Baoyao ; Xiong, Yuhong ; Liu, Wei

Author_Institution :

Hewlett-Packard Labs. China, Beijing, China

fYear :

2009

fDate :

21-23 Oct. 2009

Firstpage :

Lastpage :

Abstract :

We propose a simple approach to fast extract the main text content from Web pages, especially online news pages. Most existing approaches need to construct the DOM tree structure from the HTML source of the Web page first, and then, extract the important content by pruning/merge the DOM branches/sub-trees. Such DOM tree processing tasks are very time-consuming. Our solution processes the HTML source as a paragraphed text string directly and extracts the main text content by only analyzing the word count of text paragraphs. Compared with the existing DOM based approaches, the proposed approach is simple and fast, but not loses the accuracy. The proposed solution can be applied into practical applications with critical requirement on the efficiency, such as online news analysis. The experimental results show that our solution can efficiently and effectively extract the news content from online news pages for further analysis.

Keywords :

Web sites; hypermedia markup languages; information retrieval; text analysis; tree data structures; DOM tree structure; HTML source; Web page; main text content extraction; online news analysis; paragraphed text string; pruning; Content based retrieval; Data mining; HTML; Image segmentation; Information analysis; Information retrieval; Navigation; Tree data structures; Web pages; Web sites; Web content analysis; Web information extraction;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

e-Business Engineering, 2009. ICEBE '09. IEEE International Conference on

Conference_Location :

Macau

Print_ISBN :

978-0-7695-3842-6

Type :

conf

DOI :

10.1109/ICEBE.2009.15

Filename :

5342131

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2520398