DocumentCode :
3424282
Title :
Algorithm of web page purification based on improved DOM and statistical learning
Author :
Zhang, Yong ; Deng, Ke
Author_Institution :
Coll. of Comput. & Commun., LanZhou Univ. of Technol., Lanzhou, China
Volume :
5
fYear :
2010
fDate :
25-27 June 2010
Abstract :
In order to effectively remove the noisy information existed in web pages, such as advertisement, not related links, etc, and to improve the classification results, we proposed the algorithm of web page purification based on improved DOM tree and statistical learning. In this paper, we firstly establish block tree model by combining DOM tree and visual characteristics of web content, then statistical learning methods are used to discriminate each sub-block tree to identify the main content of the theme-based web pages. Experiment shows that the method has a good purifying effect for all kinds of theme-based web pages, the method can be applied to preprocessing stage of web page classification, which will enhance the accuracy of classification.
Keywords :
Web sites; content management; learning (artificial intelligence); pattern classification; tree data structures; DOM tree; Web content; Web page classification; Web page purification; block tree model; noisy information; statistical learning method; theme based Web page; Algorithm design and analysis; Classification tree analysis; Data mining; Educational institutions; Electronic mail; Purification; Search engines; Statistical learning; Tree data structures; Web pages; DOM tree; content block; statistical learning; web page purification;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer Design and Applications (ICCDA), 2010 International Conference on
Conference_Location :
Qinhuangdao
Print_ISBN :
978-1-4244-7164-5
Electronic_ISBN :
978-1-4244-7164-5
Type :
conf
DOI :
10.1109/ICCDA.2010.5541132
Filename :
5541132
Link To Document :
بازگشت