• DocumentCode
    2995572
  • Title

    Block-Level Linkes Based Content Extraction

  • Author

    Shen, Shixing ; Zhang, Hui

  • Author_Institution
    Beihang Univ., Beijing, China
  • fYear
    2011
  • fDate
    9-11 Dec. 2011
  • Firstpage
    330
  • Lastpage
    333
  • Abstract
    We present block-level links based content extraction (BLCE)-a method to extract content from the web pages by using the link attributes of blocks, which contains the number of links and the length of link text (anchor text).We describe how to divide one web page into blocks and how to merge the similar blocks into one, then compute the number of links and the total length of anchor text. We find that extracting content only with the number of links and length of anchor text is not effective because the number of links and length of link text are proportional to the length of page. Density of links is a good method to solve this. So we use the content links ratios and the content anchor text ratios to describe the link attribute of the blocks. BLCE performs better than other methods especially in the new web pages with DIV and CSS where traditional algorithm can´t work well.
  • Keywords
    Web sites; content management; CSS; DIV; Web pages; block level links based content extraction; content anchor text ratios; content links ratios; Cascading style sheets; Data mining; HTML; Internet; Navigation; Probability distribution; Web pages; block-level links; content extraction; merge block;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel Architectures, Algorithms and Programming (PAAP), 2011 Fourth International Symposium on
  • Conference_Location
    Tianjin
  • Print_ISBN
    978-1-4577-1808-3
  • Type

    conf

  • DOI
    10.1109/PAAP.2011.49
  • Filename
    6128527