DocumentCode :
2450548
Title :
A Novel Method to Extract Informative Blocks from Web Pages
Author :
Li, Yuancheng ; Yang, Jie
Author_Institution :
Dept. of Comput. Sci., North China Electr. Power Univ., Beijing, China
fYear :
2009
fDate :
25-26 April 2009
Firstpage :
536
Lastpage :
539
Abstract :
This paper proposes a novel algorithm to extract the informative blocks from web pages and filter the advertisement which has noting to do with the subject when people browse the Web page. In this pager, we use HTML parser to construct DOM tree and apply corresponding rules to construct a new tree (CST) which can easily help us to separate the ldquoprimary content blocksrdquo from the other blocks. Then we will use our algorithm to analysis CST and trim off useless blocks which are on the CST. The algorithms can identify primary content blocks by looking for the blocks that contains much more numbers of the block content. Our system can extract web content which is existed as the Table format or the Div format well. At last, Experiments on a set of more than thousands of web pages from 5 different sites show that the method is practical, and can achieve high accuracy.
Keywords :
Web sites; grammars; hypermedia markup languages; information filtering; DOM tree; HTML parser; Web pages; browsing; informative blocks extraction; Algorithm design and analysis; Artificial intelligence; Computer science; Data mining; Electronic mail; HTML; Information filtering; Information filters; Information systems; Web pages; CST; DOM Tree; Information System applications;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Artificial Intelligence, 2009. JCAI '09. International Joint Conference on
Conference_Location :
Hainan Island
Print_ISBN :
978-0-7695-3615-6
Type :
conf
DOI :
10.1109/JCAI.2009.156
Filename :
5159060
Link To Document :
بازگشت