مرکز منطقه ای اطلاع رساني علوم و فناوري - The study on Detecting Near-Duplicate WebPages

DocumentCode :

2507073

Title :

The study on Detecting Near-Duplicate WebPages

Author :

Cao, Yujuan ; Niu, Zhendong ; Wang, Weiqiang ; Zhao, Kun

Author_Institution :

Sch. of Comput. Sci. Technol., Beijing Inst. of Technol., Beijing

fYear :

2008

fDate :

8-11 July 2008

Firstpage :

Lastpage :

100

Abstract :

Reprinting information among websites produces a great deal redundant WebPages. To improve search efficiency and user satisfaction, an algorithm to Detect near-Duplicate WebPages (DDW) is proposed. In the course of developing a near-duplicate detection system for a multi-billion page repository, we make two research contributions. First, we consider both syntactic and semantic information to present and compute documentspsila similarities. Second, after classifying web-pages into different categories, we index feature in each category then search for near-duplicates only in the same category. From Google searching results for 72 queries, we select 5835 near-duplicate WebPages manually. Then insert them into an existing collection which contains about 768,763 WebPages, as the test data. The experimental results demonstrate that our approach outperforms I-Match algorithms. In large-scale test, approximate linear time and space complexity are gotten.

Keywords :

Web sites; classification; indexing; information retrieval; Web sites; classification; indexing; information reprinting; near-duplicate Web page detection; search efficiency; semantic information; syntactic information; user satisfaction; Aerospace control; Computer science; Data mining; Internet; Large-scale systems; Linear approximation; Plagiarism; Sampling methods; Search engines; Testing;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Computer and Information Technology, 2008. CIT 2008. 8th IEEE International Conference on

Conference_Location :

Sydney, NSW

Print_ISBN :

978-1-4244-2357-6

Electronic_ISBN :

978-1-4244-2358-3

Type :

conf

DOI :

10.1109/CIT.2008.4594656

Filename :

4594656

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2507073