DocumentCode :
2392481
Title :
Content extraction from web pages based on Gaussian Smoothing
Author :
Liao, Baohua ; Cheng, Bo ; Liu, Chuanchang ; Cheng, Junliang ; Tan, Gang
Author_Institution :
State Key Lab. of Networking & Switching Technol., Beijing Univ. of Posts & Telecommun., Beijing, China
fYear :
2010
fDate :
26-28 Oct. 2010
Firstpage :
42
Lastpage :
47
Abstract :
Web pages have been the potential source of information retrieval and data mining technology, but most HTML documents on Internet are cluttered with large amount of less informative and typically unrelated materials. Content extraction is defined as the process of identifying the main content region and removing other materials. According to the different properties between Tag and Text nodes, we propose a general, accurate and efficient content extraction framework named Gaussian Smoothing Content Extractor (GSCE) to solve this problem. In addition, based on the identifying of main content, we also describe the extraction of Title and Published Date. According to the evaluation result using large data set, GSCE achieve a high precision and recall for most Web pages.
Keywords :
Gaussian processes; Internet; content-based retrieval; hypermedia markup languages; smoothing methods; Gaussian smoothing content extractor; HTML documents; Internet; Web pages; content extraction; data mining; information retrieval; published date extraction; title extraction; HTML; Head; Tutorials; DOM; Gaussian Smoothing; content extraction; information retrieval;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Broadband Network and Multimedia Technology (IC-BNMT), 2010 3rd IEEE International Conference on
Conference_Location :
Beijing
Print_ISBN :
978-1-4244-6769-3
Type :
conf
DOI :
10.1109/ICBNMT.2010.5704866
Filename :
5704866
Link To Document :
بازگشت