• DocumentCode
    128292
  • Title

    A novel approach for content extraction from web pages

  • Author

    Bhardwaj, Arpit ; Mangat, Veenu

  • Author_Institution
    UIET, Panjab Univ., Chandigarh, India
  • fYear
    2014
  • fDate
    6-8 March 2014
  • Firstpage
    1
  • Lastpage
    4
  • Abstract
    The rapid development of the internet and web publishing techniques create numerous information sources published as HTML pages on World Wide Web. However, there is lot of redundant and irrelevant information also on web pages. Navigation panels, Table of content (TOC), advertisements, copyright statements, service catalogs, privacy policies etc. on web pages are considered as relevant and irrelevant content. Such information makes various web mining tasks such as web page crawling, web page classification, link based ranking, topic distillation complex. This paper discusses various approaches for extracting informative content from web pages and a new approach for content extraction from web pages using word to leaf ratio and density of links.
  • Keywords
    Web sites; content management; hypermedia markup languages; information retrieval; HTML pages; Internet; TOC; Web mining tasks; Web page classification; Web page crawling; Web publishing techniques; World Wide Web; advertisements; copyright statements; information sources; informative content extraction; link based ranking; links density; navigation panels; privacy policies; service catalogs; table of content; topic distillation complex; word to leaf ratio; Clustering algorithms; Data mining; Entropy; Feature extraction; HTML; Navigation; Web pages; Content Structure Tree; Content extraction; Document object Model; Entropy; Vision Based Page Segmentation; anchor text; clustering; hub and authority; ontology generation; template; web page segmentation;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Engineering and Computational Sciences (RAECS), 2014 Recent Advances in
  • Conference_Location
    Chandigarh
  • Print_ISBN
    978-1-4799-2290-1
  • Type

    conf

  • DOI
    10.1109/RAECS.2014.6799616
  • Filename
    6799616