DocumentCode :
2995572
Title :
Block-Level Linkes Based Content Extraction
Author :
Shen, Shixing ; Zhang, Hui
Author_Institution :
Beihang Univ., Beijing, China
fYear :
2011
fDate :
9-11 Dec. 2011
Firstpage :
330
Lastpage :
333
Abstract :
We present block-level links based content extraction (BLCE)-a method to extract content from the web pages by using the link attributes of blocks, which contains the number of links and the length of link text (anchor text).We describe how to divide one web page into blocks and how to merge the similar blocks into one, then compute the number of links and the total length of anchor text. We find that extracting content only with the number of links and length of anchor text is not effective because the number of links and length of link text are proportional to the length of page. Density of links is a good method to solve this. So we use the content links ratios and the content anchor text ratios to describe the link attribute of the blocks. BLCE performs better than other methods especially in the new web pages with DIV and CSS where traditional algorithm can´t work well.
Keywords :
Web sites; content management; CSS; DIV; Web pages; block level links based content extraction; content anchor text ratios; content links ratios; Cascading style sheets; Data mining; HTML; Internet; Navigation; Probability distribution; Web pages; block-level links; content extraction; merge block;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel Architectures, Algorithms and Programming (PAAP), 2011 Fourth International Symposium on
Conference_Location :
Tianjin
Print_ISBN :
978-1-4577-1808-3
Type :
conf
DOI :
10.1109/PAAP.2011.49
Filename :
6128527
Link To Document :
بازگشت