DocumentCode
2392481
Title
Content extraction from web pages based on Gaussian Smoothing
Author
Liao, Baohua ; Cheng, Bo ; Liu, Chuanchang ; Cheng, Junliang ; Tan, Gang
Author_Institution
State Key Lab. of Networking & Switching Technol., Beijing Univ. of Posts & Telecommun., Beijing, China
fYear
2010
fDate
26-28 Oct. 2010
Firstpage
42
Lastpage
47
Abstract
Web pages have been the potential source of information retrieval and data mining technology, but most HTML documents on Internet are cluttered with large amount of less informative and typically unrelated materials. Content extraction is defined as the process of identifying the main content region and removing other materials. According to the different properties between Tag and Text nodes, we propose a general, accurate and efficient content extraction framework named Gaussian Smoothing Content Extractor (GSCE) to solve this problem. In addition, based on the identifying of main content, we also describe the extraction of Title and Published Date. According to the evaluation result using large data set, GSCE achieve a high precision and recall for most Web pages.
Keywords
Gaussian processes; Internet; content-based retrieval; hypermedia markup languages; smoothing methods; Gaussian smoothing content extractor; HTML documents; Internet; Web pages; content extraction; data mining; information retrieval; published date extraction; title extraction; HTML; Head; Tutorials; DOM; Gaussian Smoothing; content extraction; information retrieval;
fLanguage
English
Publisher
ieee
Conference_Titel
Broadband Network and Multimedia Technology (IC-BNMT), 2010 3rd IEEE International Conference on
Conference_Location
Beijing
Print_ISBN
978-1-4244-6769-3
Type
conf
DOI
10.1109/ICBNMT.2010.5704866
Filename
5704866
Link To Document