A Comprehensive Survey on Web Content Extraction Algorithms and Techniques

Author

Al-Ghuribi, Sumaia Mohammed ; Alshomrani, Saleh

Author_Institution

Fac. of Comput. & Inf. Technol., King Abdulaziz Univ., Jeddah, Saudi Arabia

fYear

2013

fDate

24-26 June 2013

Firstpage

1

Lastpage

5

Abstract

Web Content Extraction is an important problem that has been studied through different approaches and algorithms. It is interested in extracting meaningful and useful data from the Webpage which is surrounded with many noisy data such as advertisements and navigation links. Many applications get benefits from the extracted content such as crawlers, indexers, document classification, and Information retrieval. This survey aims at providing a comprehensive overview of many approaches that constructed for extracting Webpage content. In this survey, Web Content Extraction approaches are classified into categories and for each category, some approaches are given in details with their weakness. Based on analyzing the given approaches deeply, we can draw the fundamentals factors for constructing the optimal Web content extractor.

Keywords

Web sites; content management; data mining; pattern classification; Web crawlers; Webpage content extraction algorithm; Webpage content extraction technique; Webpage data extraction; advertisement links; document classification; indexers; information retrieval; navigation links; noisy data; optimal Web content extractor; Algorithm design and analysis; Classification algorithms; Data mining; Feature extraction; HTML; Visualization; Web sites;

fLanguage

English

Publisher

ieee

Conference_Titel

Information Science and Applications (ICISA), 2013 International Conference on

Conference_Location

Suwon

Print_ISBN

978-1-4799-0602-4

Type

conf

DOI

10.1109/ICISA.2013.6579445

Filename

6579445