DocumentCode
128292
Title
A novel approach for content extraction from web pages
Author
Bhardwaj, Arpit ; Mangat, Veenu
Author_Institution
UIET, Panjab Univ., Chandigarh, India
fYear
2014
fDate
6-8 March 2014
Firstpage
1
Lastpage
4
Abstract
The rapid development of the internet and web publishing techniques create numerous information sources published as HTML pages on World Wide Web. However, there is lot of redundant and irrelevant information also on web pages. Navigation panels, Table of content (TOC), advertisements, copyright statements, service catalogs, privacy policies etc. on web pages are considered as relevant and irrelevant content. Such information makes various web mining tasks such as web page crawling, web page classification, link based ranking, topic distillation complex. This paper discusses various approaches for extracting informative content from web pages and a new approach for content extraction from web pages using word to leaf ratio and density of links.
Keywords
Web sites; content management; hypermedia markup languages; information retrieval; HTML pages; Internet; TOC; Web mining tasks; Web page classification; Web page crawling; Web publishing techniques; World Wide Web; advertisements; copyright statements; information sources; informative content extraction; link based ranking; links density; navigation panels; privacy policies; service catalogs; table of content; topic distillation complex; word to leaf ratio; Clustering algorithms; Data mining; Entropy; Feature extraction; HTML; Navigation; Web pages; Content Structure Tree; Content extraction; Document object Model; Entropy; Vision Based Page Segmentation; anchor text; clustering; hub and authority; ontology generation; template; web page segmentation;
fLanguage
English
Publisher
ieee
Conference_Titel
Engineering and Computational Sciences (RAECS), 2014 Recent Advances in
Conference_Location
Chandigarh
Print_ISBN
978-1-4799-2290-1
Type
conf
DOI
10.1109/RAECS.2014.6799616
Filename
6799616
Link To Document