CELB: Content extraction based on line-block

Author

Ma, Xiao ; Chen, Jiangfeng ; Zhang, Hui

Author_Institution

Sch. of Comput. Sci., Beihang Univ., Beijing, China

fYear

2011

fDate

Nov. 29 2011-Dec. 1 2011

Firstpage

412

Lastpage

417

Abstract

In this paper, we propose a simple, fast and accurate content extraction method: CELB. Compared with traditional methods, this approach does not parse the DOM trees and uses only information from lines of original HTML documents. We propose a concept called line-block, to extract contents more effectively and a new feature distance-text number (DTN) for distinctions between contents and non-contents. First, we preprocess original HTML documents, and then combine lines into line-blocks. Next, we calculate values of content features for each line-block, and use thresholds to determine whether a lineblock is part of the main content or not. Experiments show satisfied results, especially for the running time.

Keywords

hypermedia markup languages; text analysis; CELB; DOM trees; DTN; HTML documents; content extraction method; content features; distance-text number; on line-block; Chaos; HTML; Internet; Noise; Standards; Web pages;

fLanguage

English

Publisher

ieee

Conference_Titel

Computer Sciences and Convergence Information Technology (ICCIT), 2011 6th International Conference on

Conference_Location

Seogwipo

Print_ISBN

978-1-4577-0472-7

Type

conf

Filename

6316649

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=575013