DocumentCode :
2787572
Title :
Noise Reduction of Web Pages via Feature Analysis
Author :
Kun Jiang ; Yuexiang Yang
Author_Institution :
Coll. of Comput., Nat. Univ. of Defense Technol., Changsha, China
fYear :
2015
fDate :
24-26 April 2015
Firstpage :
345
Lastpage :
348
Abstract :
Noise information has a serious impact on various studies that using web pages as datasets. As a fundamental work in information retrieval, removing noise in web pages quickly and accurately received widely attention. In this paper, a noise reduction algorithm which uses DOM (Document Object Model) to preserve the original structure of web pages is proposed to the issue of low efficiency of traditional noise reduction algorithms. Using this method, noise information can be located rapidly by a combination of several analyzed features, e.g. Link Density and Punctuation Density. The approach is evaluated by a group of web pages that selected randomly from several well-known websites. Experiments show satisfactory results.
Keywords :
Internet; information filtering; DOM; Web pages; Web sites; document object model; information retrieval; link density; noise information; noise reduction algorithm; punctuation density; Accuracy; Algorithm design and analysis; HTML; Noise; Noise reduction; Web pages; DOM Tree; Feature Analysis; Information Retrieval; Noise Reduction;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Information Science and Control Engineering (ICISCE), 2015 2nd International Conference on
Conference_Location :
Shanghai
Print_ISBN :
978-1-4673-6849-0
Type :
conf
DOI :
10.1109/ICISCE.2015.83
Filename :
7120623
Link To Document :
بازگشت