Title :
The Noise Reduction Method of Web Pages Based on Image Features
Author :
Yao, Haitao ; Yin, Zhiyi ; Zhu, Fuxi ; Gong, Changsheng
Author_Institution :
Sch. of Comput., Wuhan Univ., Wuhan, China
Abstract :
Same layer Webpage have similar presentation styles and noise blocks. The first step of data mining is to remove noise blocks from Web pages. Different from traditional similarity measurement method based on DOM trees, a noise removal method based on image features is proposed in this paper. In this method, Web pages are processed as images. And then, all of image features can be flexibly used as criteria to measure similarity of noise blocks. As a result, noise blocks and information blocks can be distinguished after measuring similarity, and the reduction of noise is realized. The results of experiment demonstrate that this method is accurate and reliable and it can support joint measurement of multiple image features.
Keywords :
Internet; data mining; document image processing; Web pages; data mining; image feature; information block; noise block; noise reduction method; noise removal method; Cleaning; Data mining; HTML; Information analysis; Internet; Navigation; Noise measurement; Noise reduction; Web pages; XML;
Conference_Titel :
Computational Intelligence and Software Engineering, 2009. CiSE 2009. International Conference on
Conference_Location :
Wuhan
Print_ISBN :
978-1-4244-4507-3
Electronic_ISBN :
978-1-4244-4507-3
DOI :
10.1109/CISE.2009.5366410