Title :
Efficient Web Page Main Text Extraction towards Online News Analysis
Author :
Zhou, Baoyao ; Xiong, Yuhong ; Liu, Wei
Author_Institution :
Hewlett-Packard Labs. China, Beijing, China
Abstract :
We propose a simple approach to fast extract the main text content from Web pages, especially online news pages. Most existing approaches need to construct the DOM tree structure from the HTML source of the Web page first, and then, extract the important content by pruning/merge the DOM branches/sub-trees. Such DOM tree processing tasks are very time-consuming. Our solution processes the HTML source as a paragraphed text string directly and extracts the main text content by only analyzing the word count of text paragraphs. Compared with the existing DOM based approaches, the proposed approach is simple and fast, but not loses the accuracy. The proposed solution can be applied into practical applications with critical requirement on the efficiency, such as online news analysis. The experimental results show that our solution can efficiently and effectively extract the news content from online news pages for further analysis.
Keywords :
Web sites; hypermedia markup languages; information retrieval; text analysis; tree data structures; DOM tree structure; HTML source; Web page; main text content extraction; online news analysis; paragraphed text string; pruning; Content based retrieval; Data mining; HTML; Image segmentation; Information analysis; Information retrieval; Navigation; Tree data structures; Web pages; Web sites; Web content analysis; Web information extraction;
Conference_Titel :
e-Business Engineering, 2009. ICEBE '09. IEEE International Conference on
Conference_Location :
Macau
Print_ISBN :
978-0-7695-3842-6
DOI :
10.1109/ICEBE.2009.15