Title :
Noise Reduction of Web Pages via Feature Analysis
Author :
Kun Jiang ; Yuexiang Yang
Author_Institution :
Coll. of Comput., Nat. Univ. of Defense Technol., Changsha, China
Abstract :
Noise information has a serious impact on various studies that using web pages as datasets. As a fundamental work in information retrieval, removing noise in web pages quickly and accurately received widely attention. In this paper, a noise reduction algorithm which uses DOM (Document Object Model) to preserve the original structure of web pages is proposed to the issue of low efficiency of traditional noise reduction algorithms. Using this method, noise information can be located rapidly by a combination of several analyzed features, e.g. Link Density and Punctuation Density. The approach is evaluated by a group of web pages that selected randomly from several well-known websites. Experiments show satisfactory results.
Keywords :
Internet; information filtering; DOM; Web pages; Web sites; document object model; information retrieval; link density; noise information; noise reduction algorithm; punctuation density; Accuracy; Algorithm design and analysis; HTML; Noise; Noise reduction; Web pages; DOM Tree; Feature Analysis; Information Retrieval; Noise Reduction;
Conference_Titel :
Information Science and Control Engineering (ICISCE), 2015 2nd International Conference on
Conference_Location :
Shanghai
Print_ISBN :
978-1-4673-6849-0
DOI :
10.1109/ICISCE.2015.83