DocumentCode :
1970676
Title :
Extracting Content from Web Pages Based on RSS
Author :
Qingcheng, Li ; Youmeng, Li
Author_Institution :
Nankai Univ., Tianjin
Volume :
5
fYear :
2008
fDate :
12-14 Dec. 2008
Firstpage :
218
Lastpage :
221
Abstract :
This paper proposes a new method to content extraction from Web pages based on an index of RSS. Discover the collection of structural similarity web page documents in the RSS feed, and find the page template with the algorithm. By computing the feature of content blocks, obtain the body template. And achieve to a batch extraction from Web page in this collection finally. The method has a strong fault tolerance for the Web documents. And the results showed that it has high accuracy and widely adaptive.
Keywords :
Web sites; document handling; information retrieval; RSS; Web documents; Web pages; content extraction; Data mining; Fault tolerance; Feeds; HTML; Information filtering; Information filters; Information processing; Internet; Navigation; Web pages; RSS; Web template; content extraction; web block;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer Science and Software Engineering, 2008 International Conference on
Conference_Location :
Wuhan, Hubei
Print_ISBN :
978-0-7695-3336-0
Type :
conf
DOI :
10.1109/CSSE.2008.85
Filename :
4722882
Link To Document :
بازگشت