Title :
News Web Text Extraction Based on the Maximum Subsequence Segmentation
Author :
Jianzhuo Yan ; Hexin Duan ; Liying Fang ; Wang Ying
Author_Institution :
Coll. of Electron. Inf. & Control Eng., Beijing Univ. of Technol., Beijing, China
Abstract :
Many people use the web as the main information source in their daily lives. However, most web pages contain non-information components, such as site bars, footers and ads, etc., which make it complicated to extract text from the original HTML documents. Because of the high human intervention and the low results extraction quality, although the web text extraction techniques have been developed, the popularization and efficiency of the usage still need to be solved.. In this paper, we proposed a maximum subsequence segmentation (MSS) algorithm and discussed its application in the domain of news web sites. Differing from the tree structure analysis and VIPS, the algorithm divided the web into text segmentation and label segmentation. Experiment shows that the MSS algorithm achieves 93.73% accuracy over 2000 news pages from 5 different news sites and the efficiency is much faster than DOM-based using same dataset.
Keywords :
Web sites; hypermedia markup languages; information retrieval; text analysis; tree data structures; HTML documents; MSS algorithm; VIPS; Web pages; ads; extraction quality; footers; human intervention; information source; label segmentation; maximum subsequence segmentation; news Web sites; news Web text extraction; news pages; noninformation components; site bars; text segmentation; tree structure analysis; Accuracy; Algorithm design and analysis; Data mining; HTML; Navigation; Noise; Web pages; maximum subsequence segmentation; web text extraction;
Conference_Titel :
Computational and Information Sciences (ICCIS), 2013 Fifth International Conference on
Conference_Location :
Shiyang
DOI :
10.1109/ICCIS.2013.170