DocumentCode
2117181
Title
On the Uniform Sampling of the Web: An Improvement on Bucket Based Sampling
Author
Heidari, Sanaz ; Mousavi, Hamid ; Movaghar, Ali
Author_Institution
CE Dept., Qazvin Univ. of Tech., Tehran
fYear
2009
fDate
27-28 Feb. 2009
Firstpage
205
Lastpage
209
Abstract
Web is one of the biggest sources of information. The tremendous size, the dynamicity, and the structure of the Web have made the information retrieval process of the Web a challenging issue. Web search engines (WSEs) have started to help users with this matter. However, these types of application, to perform more effectively, always need current information about many characteristics of the Web. To determine these characteristics, one way is to use statistical sampling of the Web pages. In this kind of approaches, instead of analyzing a large number of Web pages, a rather smaller and more uniform set of Web pages is used. This research attempts to analyze the presented methods for generating uniform samples of the pages from the World Wide Web. It specifically focuses on a new method called BBS. Briefly, we improved BBS at least by 4.45% regarding the uniformity of the samples. Using this improved BBS, we estimated the size of the public indexable Web at 27.4 Billion pages. The index sizes of some commercial WSEs are also estimated and compared.
Keywords
Internet; information retrieval; search engines; Web pages; Web search engines; World Wide Web; bucket based sampling; information retrieval process; uniform sampling; Content based retrieval; Equal opportunities; Information resources; Information retrieval; Sampling methods; Search engines; Testing; Web pages; Web search; Web sites; Uniform Sampling.; Web; Web Search Engine;
fLanguage
English
Publisher
ieee
Conference_Titel
Communication Software and Networks, 2009. ICCSN '09. International Conference on
Conference_Location
Macau
Print_ISBN
978-0-7695-3522-7
Type
conf
DOI
10.1109/ICCSN.2009.164
Filename
5076840
Link To Document