A Novel Method to Extract Informative Blocks from Web Pages

Author

Li, Yuancheng ; Yang, Jie

Author_Institution

Dept. of Comput. Sci., North China Electr. Power Univ., Beijing, China

fYear

2009

fDate

25-26 April 2009

Firstpage

536

Lastpage

539

Abstract

This paper proposes a novel algorithm to extract the informative blocks from web pages and filter the advertisement which has noting to do with the subject when people browse the Web page. In this pager, we use HTML parser to construct DOM tree and apply corresponding rules to construct a new tree (CST) which can easily help us to separate the ldquoprimary content blocksrdquo from the other blocks. Then we will use our algorithm to analysis CST and trim off useless blocks which are on the CST. The algorithms can identify primary content blocks by looking for the blocks that contains much more numbers of the block content. Our system can extract web content which is existed as the Table format or the Div format well. At last, Experiments on a set of more than thousands of web pages from 5 different sites show that the method is practical, and can achieve high accuracy.

Keywords

Web sites; grammars; hypermedia markup languages; information filtering; DOM tree; HTML parser; Web pages; browsing; informative blocks extraction; Algorithm design and analysis; Artificial intelligence; Computer science; Data mining; Electronic mail; HTML; Information filtering; Information filters; Information systems; Web pages; CST; DOM Tree; Information System applications;

fLanguage

English

Publisher

ieee

Conference_Titel

Artificial Intelligence, 2009. JCAI '09. International Joint Conference on

Conference_Location

Hainan Island

Print_ISBN

978-0-7695-3615-6

Type

conf

DOI

10.1109/JCAI.2009.156

Filename

5159060