• DocumentCode
    575013
  • Title

    CELB: Content extraction based on line-block

  • Author

    Ma, Xiao ; Chen, Jiangfeng ; Zhang, Hui

  • Author_Institution
    Sch. of Comput. Sci., Beihang Univ., Beijing, China
  • fYear
    2011
  • fDate
    Nov. 29 2011-Dec. 1 2011
  • Firstpage
    412
  • Lastpage
    417
  • Abstract
    In this paper, we propose a simple, fast and accurate content extraction method: CELB. Compared with traditional methods, this approach does not parse the DOM trees and uses only information from lines of original HTML documents. We propose a concept called line-block, to extract contents more effectively and a new feature distance-text number (DTN) for distinctions between contents and non-contents. First, we preprocess original HTML documents, and then combine lines into line-blocks. Next, we calculate values of content features for each line-block, and use thresholds to determine whether a lineblock is part of the main content or not. Experiments show satisfied results, especially for the running time.
  • Keywords
    hypermedia markup languages; text analysis; CELB; DOM trees; DTN; HTML documents; content extraction method; content features; distance-text number; on line-block; Chaos; HTML; Internet; Noise; Standards; Web pages;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer Sciences and Convergence Information Technology (ICCIT), 2011 6th International Conference on
  • Conference_Location
    Seogwipo
  • Print_ISBN
    978-1-4577-0472-7
  • Type

    conf

  • Filename
    6316649