DocumentCode :
2780789
Title :
A novel web page duplication detection framework
Author :
Han, Zhongming ; Duan, Dagao ; Liu, Hongzhi ; Sun, Jianzhi
Author_Institution :
Sch. of Comput. Sci. & Inf. Eng., Beijing Technol. & Bus. Univ., Beijing, China
fYear :
2009
fDate :
6-8 Nov. 2009
Firstpage :
374
Lastpage :
378
Abstract :
There are a lot of redundant Web pages on Internet. Based on tag statistic and text similarity comparison, we present a novel multilayer framework for detecting duplicated Web pages in this paper. We propose two similarity text paragraphs detection algorithms and implement our framework. The experimental results show that our approach achieves high performance, which means that duplicated Web pages can be efficiently detected simply by tag statistic and text comparison.
Keywords :
Internet; Web sites; text analysis; Internet; Web page duplication detection; multilayer framework; redundant Web pages; similarity comparison; similarity text paragraphs detection algorithms; tag statistic; Computer science; Data mining; Fingers; HTML; Internet; Navigation; Nonhomogeneous media; Statistics; Sun; Web pages; Duplication Detection; Framework; Web Page;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Network Infrastructure and Digital Content, 2009. IC-NIDC 2009. IEEE International Conference on
Conference_Location :
Beijing
Print_ISBN :
978-1-4244-4898-2
Electronic_ISBN :
978-1-4244-4900-6
Type :
conf
DOI :
10.1109/ICNIDC.2009.5360814
Filename :
5360814
Link To Document :
بازگشت