Title :
The study on Detecting Near-Duplicate WebPages
Author :
Cao, Yujuan ; Niu, Zhendong ; Wang, Weiqiang ; Zhao, Kun
Author_Institution :
Sch. of Comput. Sci. Technol., Beijing Inst. of Technol., Beijing
Abstract :
Reprinting information among websites produces a great deal redundant WebPages. To improve search efficiency and user satisfaction, an algorithm to Detect near-Duplicate WebPages (DDW) is proposed. In the course of developing a near-duplicate detection system for a multi-billion page repository, we make two research contributions. First, we consider both syntactic and semantic information to present and compute documentspsila similarities. Second, after classifying web-pages into different categories, we index feature in each category then search for near-duplicates only in the same category. From Google searching results for 72 queries, we select 5835 near-duplicate WebPages manually. Then insert them into an existing collection which contains about 768,763 WebPages, as the test data. The experimental results demonstrate that our approach outperforms I-Match algorithms. In large-scale test, approximate linear time and space complexity are gotten.
Keywords :
Web sites; classification; indexing; information retrieval; Web sites; classification; indexing; information reprinting; near-duplicate Web page detection; search efficiency; semantic information; syntactic information; user satisfaction; Aerospace control; Computer science; Data mining; Internet; Large-scale systems; Linear approximation; Plagiarism; Sampling methods; Search engines; Testing;
Conference_Titel :
Computer and Information Technology, 2008. CIT 2008. 8th IEEE International Conference on
Conference_Location :
Sydney, NSW
Print_ISBN :
978-1-4244-2357-6
Electronic_ISBN :
978-1-4244-2358-3
DOI :
10.1109/CIT.2008.4594656