Title :
Primary content extraction with Mountain Model
Author :
Bing, Lidong ; Wang, Yexin ; Zhang, Yan ; Wang, Hui
Author_Institution :
Key Lab. of Machine Perception, Peking Univ., Beijing
Abstract :
It is necessary to eliminate cluttered information in Web pages, such as navigation bars, related readings, copyright notices, since they can cause additional burden to search engines. In this paper, a Web page is treated as a sequence of content cells, where each cell owns its score according to our Mountain Model. Primary content cells are distinguished from those cluttered content cells by the features processed only by primary cells. A universal classifier is trained based on these features for a global utility. To make it more precise, we also provide a site-oriented classifier. An algorithm is thus schemed out for primary content extraction based on Mountain Model. Experimental results show that our model works with both accuracy and time efficiency compared with the existing models.
Keywords :
Web sites; classification; search engines; Web pages; cluttered content cells; cluttered information; content extraction; copyright notices; mountain model; navigation bars; search engines; site-oriented classifier; Bars; Buildings; Data mining; Filters; Humans; Laboratories; Navigation; Publishing; Search engines; Web pages;
Conference_Titel :
Computer and Information Technology, 2008. CIT 2008. 8th IEEE International Conference on
Conference_Location :
Sydney, NSW
Print_ISBN :
978-1-4244-2357-6
Electronic_ISBN :
978-1-4244-2358-3
DOI :
10.1109/CIT.2008.4594722