• DocumentCode
    2450548
  • Title

    A Novel Method to Extract Informative Blocks from Web Pages

  • Author

    Li, Yuancheng ; Yang, Jie

  • Author_Institution
    Dept. of Comput. Sci., North China Electr. Power Univ., Beijing, China
  • fYear
    2009
  • fDate
    25-26 April 2009
  • Firstpage
    536
  • Lastpage
    539
  • Abstract
    This paper proposes a novel algorithm to extract the informative blocks from web pages and filter the advertisement which has noting to do with the subject when people browse the Web page. In this pager, we use HTML parser to construct DOM tree and apply corresponding rules to construct a new tree (CST) which can easily help us to separate the ldquoprimary content blocksrdquo from the other blocks. Then we will use our algorithm to analysis CST and trim off useless blocks which are on the CST. The algorithms can identify primary content blocks by looking for the blocks that contains much more numbers of the block content. Our system can extract web content which is existed as the Table format or the Div format well. At last, Experiments on a set of more than thousands of web pages from 5 different sites show that the method is practical, and can achieve high accuracy.
  • Keywords
    Web sites; grammars; hypermedia markup languages; information filtering; DOM tree; HTML parser; Web pages; browsing; informative blocks extraction; Algorithm design and analysis; Artificial intelligence; Computer science; Data mining; Electronic mail; HTML; Information filtering; Information filters; Information systems; Web pages; CST; DOM Tree; Information System applications;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Artificial Intelligence, 2009. JCAI '09. International Joint Conference on
  • Conference_Location
    Hainan Island
  • Print_ISBN
    978-0-7695-3615-6
  • Type

    conf

  • DOI
    10.1109/JCAI.2009.156
  • Filename
    5159060