DocumentCode
2893884
Title
A Comprehensive Survey on Web Content Extraction Algorithms and Techniques
Author
Al-Ghuribi, Sumaia Mohammed ; Alshomrani, Saleh
Author_Institution
Fac. of Comput. & Inf. Technol., King Abdulaziz Univ., Jeddah, Saudi Arabia
fYear
2013
fDate
24-26 June 2013
Firstpage
1
Lastpage
5
Abstract
Web Content Extraction is an important problem that has been studied through different approaches and algorithms. It is interested in extracting meaningful and useful data from the Webpage which is surrounded with many noisy data such as advertisements and navigation links. Many applications get benefits from the extracted content such as crawlers, indexers, document classification, and Information retrieval. This survey aims at providing a comprehensive overview of many approaches that constructed for extracting Webpage content. In this survey, Web Content Extraction approaches are classified into categories and for each category, some approaches are given in details with their weakness. Based on analyzing the given approaches deeply, we can draw the fundamentals factors for constructing the optimal Web content extractor.
Keywords
Web sites; content management; data mining; pattern classification; Web crawlers; Webpage content extraction algorithm; Webpage content extraction technique; Webpage data extraction; advertisement links; document classification; indexers; information retrieval; navigation links; noisy data; optimal Web content extractor; Algorithm design and analysis; Classification algorithms; Data mining; Feature extraction; HTML; Visualization; Web sites;
fLanguage
English
Publisher
ieee
Conference_Titel
Information Science and Applications (ICISA), 2013 International Conference on
Conference_Location
Suwon
Print_ISBN
978-1-4799-0602-4
Type
conf
DOI
10.1109/ICISA.2013.6579445
Filename
6579445
Link To Document