A novel approach for content extraction from web pages

Author

Bhardwaj, Arpit ; Mangat, Veenu

Author_Institution

UIET, Panjab Univ., Chandigarh, India

fYear

2014

fDate

6-8 March 2014

Firstpage

1

Lastpage

4

Abstract

The rapid development of the internet and web publishing techniques create numerous information sources published as HTML pages on World Wide Web. However, there is lot of redundant and irrelevant information also on web pages. Navigation panels, Table of content (TOC), advertisements, copyright statements, service catalogs, privacy policies etc. on web pages are considered as relevant and irrelevant content. Such information makes various web mining tasks such as web page crawling, web page classification, link based ranking, topic distillation complex. This paper discusses various approaches for extracting informative content from web pages and a new approach for content extraction from web pages using word to leaf ratio and density of links.

Keywords

Web sites; content management; hypermedia markup languages; information retrieval; HTML pages; Internet; TOC; Web mining tasks; Web page classification; Web page crawling; Web publishing techniques; World Wide Web; advertisements; copyright statements; information sources; informative content extraction; link based ranking; links density; navigation panels; privacy policies; service catalogs; table of content; topic distillation complex; word to leaf ratio; Clustering algorithms; Data mining; Entropy; Feature extraction; HTML; Navigation; Web pages; Content Structure Tree; Content extraction; Document object Model; Entropy; Vision Based Page Segmentation; anchor text; clustering; hub and authority; ontology generation; template; web page segmentation;

fLanguage

English

Publisher

ieee

Conference_Titel

Engineering and Computational Sciences (RAECS), 2014 Recent Advances in

Conference_Location

Chandigarh

Print_ISBN

978-1-4799-2290-1

Type

conf

DOI

10.1109/RAECS.2014.6799616

Filename

6799616