DocumentCode :
128292
Title :
A novel approach for content extraction from web pages
Author :
Bhardwaj, Arpit ; Mangat, Veenu
Author_Institution :
UIET, Panjab Univ., Chandigarh, India
fYear :
2014
fDate :
6-8 March 2014
Firstpage :
1
Lastpage :
4
Abstract :
The rapid development of the internet and web publishing techniques create numerous information sources published as HTML pages on World Wide Web. However, there is lot of redundant and irrelevant information also on web pages. Navigation panels, Table of content (TOC), advertisements, copyright statements, service catalogs, privacy policies etc. on web pages are considered as relevant and irrelevant content. Such information makes various web mining tasks such as web page crawling, web page classification, link based ranking, topic distillation complex. This paper discusses various approaches for extracting informative content from web pages and a new approach for content extraction from web pages using word to leaf ratio and density of links.
Keywords :
Web sites; content management; hypermedia markup languages; information retrieval; HTML pages; Internet; TOC; Web mining tasks; Web page classification; Web page crawling; Web publishing techniques; World Wide Web; advertisements; copyright statements; information sources; informative content extraction; link based ranking; links density; navigation panels; privacy policies; service catalogs; table of content; topic distillation complex; word to leaf ratio; Clustering algorithms; Data mining; Entropy; Feature extraction; HTML; Navigation; Web pages; Content Structure Tree; Content extraction; Document object Model; Entropy; Vision Based Page Segmentation; anchor text; clustering; hub and authority; ontology generation; template; web page segmentation;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Engineering and Computational Sciences (RAECS), 2014 Recent Advances in
Conference_Location :
Chandigarh
Print_ISBN :
978-1-4799-2290-1
Type :
conf
DOI :
10.1109/RAECS.2014.6799616
Filename :
6799616
Link To Document :
بازگشت