مرکز منطقه ای اطلاع رساني علوم و فناوري - Extracting Content from Web Pages Based on RSS

DocumentCode :

1970676

Title :

Extracting Content from Web Pages Based on RSS

Author :

Qingcheng, Li ; Youmeng, Li

Author_Institution :

Nankai Univ., Tianjin

Volume :

fYear :

2008

fDate :

12-14 Dec. 2008

Firstpage :

218

Lastpage :

221

Abstract :

This paper proposes a new method to content extraction from Web pages based on an index of RSS. Discover the collection of structural similarity web page documents in the RSS feed, and find the page template with the algorithm. By computing the feature of content blocks, obtain the body template. And achieve to a batch extraction from Web page in this collection finally. The method has a strong fault tolerance for the Web documents. And the results showed that it has high accuracy and widely adaptive.

Keywords :

Web sites; document handling; information retrieval; RSS; Web documents; Web pages; content extraction; Data mining; Fault tolerance; Feeds; HTML; Information filtering; Information filters; Information processing; Internet; Navigation; Web pages; RSS; Web template; content extraction; web block;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Computer Science and Software Engineering, 2008 International Conference on

Conference_Location :

Wuhan, Hubei

Print_ISBN :

978-0-7695-3336-0

Type :

conf

DOI :

10.1109/CSSE.2008.85

Filename :

4722882

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1970676