Title :
CELB: Content extraction based on line-block
Author :
Ma, Xiao ; Chen, Jiangfeng ; Zhang, Hui
Author_Institution :
Sch. of Comput. Sci., Beihang Univ., Beijing, China
fDate :
Nov. 29 2011-Dec. 1 2011
Abstract :
In this paper, we propose a simple, fast and accurate content extraction method: CELB. Compared with traditional methods, this approach does not parse the DOM trees and uses only information from lines of original HTML documents. We propose a concept called line-block, to extract contents more effectively and a new feature distance-text number (DTN) for distinctions between contents and non-contents. First, we preprocess original HTML documents, and then combine lines into line-blocks. Next, we calculate values of content features for each line-block, and use thresholds to determine whether a lineblock is part of the main content or not. Experiments show satisfied results, especially for the running time.
Keywords :
hypermedia markup languages; text analysis; CELB; DOM trees; DTN; HTML documents; content extraction method; content features; distance-text number; on line-block; Chaos; HTML; Internet; Noise; Standards; Web pages;
Conference_Titel :
Computer Sciences and Convergence Information Technology (ICCIT), 2011 6th International Conference on
Conference_Location :
Seogwipo
Print_ISBN :
978-1-4577-0472-7