Title :
Extracting Content from Web Pages Based on RSS
Author :
Qingcheng, Li ; Youmeng, Li
Author_Institution :
Nankai Univ., Tianjin
Abstract :
This paper proposes a new method to content extraction from Web pages based on an index of RSS. Discover the collection of structural similarity web page documents in the RSS feed, and find the page template with the algorithm. By computing the feature of content blocks, obtain the body template. And achieve to a batch extraction from Web page in this collection finally. The method has a strong fault tolerance for the Web documents. And the results showed that it has high accuracy and widely adaptive.
Keywords :
Web sites; document handling; information retrieval; RSS; Web documents; Web pages; content extraction; Data mining; Fault tolerance; Feeds; HTML; Information filtering; Information filters; Information processing; Internet; Navigation; Web pages; RSS; Web template; content extraction; web block;
Conference_Titel :
Computer Science and Software Engineering, 2008 International Conference on
Conference_Location :
Wuhan, Hubei
Print_ISBN :
978-0-7695-3336-0
DOI :
10.1109/CSSE.2008.85