مرکز منطقه ای اطلاع رساني علوم و فناوري - Where and How Duplicates Occur in the Web

DocumentCode :

2805055

Title :

Where and How Duplicates Occur in the Web

Author :

Pereira, Antonio ; Baeza-Yates, Ricardo ; Ziviani, Nivio

Author_Institution :

Dept. of Comput. Sci., Fed. Univ. of Minas Gerais

fYear :

2006

fDate :

Oct. 2006

Firstpage :

127

Lastpage :

134

Abstract :

In this paper we study duplicates on the Web, using collections containing documents of all sites under the .cl domain that represent accurate and representative subsets of the Web. We identify duplicate and near-duplicate documents in our collections, studying the distribution of documents in clusters of duplicates. We also study the occurrence of duplicates in both parts of our Web graphs - connected and disconnected component - aiming to identify where duplicates occur more frequently. We originally show that the number of duplicates in the Web is expressively greater than the number of duplicates in the connected component of the Web graph. Works that previously estimated the number of duplicates in the Web used collections of connected components of the Web. In those cases the sample of the Web was biased

Keywords :

Internet; document handling; graph theory; Web duplicates; Web graphs; document duplicate distribution; Clustering algorithms; Computer science; Crawlers; Fingerprint recognition; Search engines; Web pages; Web search;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Web Congress, 2006. LA-Web '06. Fourth Latin American

Conference_Location :

Cholula

Print_ISBN :

0-7695-2693-4

Type :

conf

DOI :

10.1109/LA-WEB.2006.39

Filename :

4022102

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2805055