DocumentCode
2995572
Title
Block-Level Linkes Based Content Extraction
Author
Shen, Shixing ; Zhang, Hui
Author_Institution
Beihang Univ., Beijing, China
fYear
2011
fDate
9-11 Dec. 2011
Firstpage
330
Lastpage
333
Abstract
We present block-level links based content extraction (BLCE)-a method to extract content from the web pages by using the link attributes of blocks, which contains the number of links and the length of link text (anchor text).We describe how to divide one web page into blocks and how to merge the similar blocks into one, then compute the number of links and the total length of anchor text. We find that extracting content only with the number of links and length of anchor text is not effective because the number of links and length of link text are proportional to the length of page. Density of links is a good method to solve this. So we use the content links ratios and the content anchor text ratios to describe the link attribute of the blocks. BLCE performs better than other methods especially in the new web pages with DIV and CSS where traditional algorithm can´t work well.
Keywords
Web sites; content management; CSS; DIV; Web pages; block level links based content extraction; content anchor text ratios; content links ratios; Cascading style sheets; Data mining; HTML; Internet; Navigation; Probability distribution; Web pages; block-level links; content extraction; merge block;
fLanguage
English
Publisher
ieee
Conference_Titel
Parallel Architectures, Algorithms and Programming (PAAP), 2011 Fourth International Symposium on
Conference_Location
Tianjin
Print_ISBN
978-1-4577-1808-3
Type
conf
DOI
10.1109/PAAP.2011.49
Filename
6128527
Link To Document