Content extraction from web pages based on Gaussian Smoothing

Author

Liao, Baohua ; Cheng, Bo ; Liu, Chuanchang ; Cheng, Junliang ; Tan, Gang

Author_Institution

State Key Lab. of Networking & Switching Technol., Beijing Univ. of Posts & Telecommun., Beijing, China

fYear

2010

fDate

26-28 Oct. 2010

Firstpage

42

Lastpage

47

Abstract

Web pages have been the potential source of information retrieval and data mining technology, but most HTML documents on Internet are cluttered with large amount of less informative and typically unrelated materials. Content extraction is defined as the process of identifying the main content region and removing other materials. According to the different properties between Tag and Text nodes, we propose a general, accurate and efficient content extraction framework named Gaussian Smoothing Content Extractor (GSCE) to solve this problem. In addition, based on the identifying of main content, we also describe the extraction of Title and Published Date. According to the evaluation result using large data set, GSCE achieve a high precision and recall for most Web pages.

Keywords

Gaussian processes; Internet; content-based retrieval; hypermedia markup languages; smoothing methods; Gaussian smoothing content extractor; HTML documents; Internet; Web pages; content extraction; data mining; information retrieval; published date extraction; title extraction; HTML; Head; Tutorials; DOM; Gaussian Smoothing; content extraction; information retrieval;

fLanguage

English

Publisher

ieee

Conference_Titel

Broadband Network and Multimedia Technology (IC-BNMT), 2010 3rd IEEE International Conference on

Conference_Location

Beijing

Print_ISBN

978-1-4244-6769-3

Type

conf

DOI

10.1109/ICBNMT.2010.5704866

Filename

5704866