• DocumentCode
    2117181
  • Title

    On the Uniform Sampling of the Web: An Improvement on Bucket Based Sampling

  • Author

    Heidari, Sanaz ; Mousavi, Hamid ; Movaghar, Ali

  • Author_Institution
    CE Dept., Qazvin Univ. of Tech., Tehran
  • fYear
    2009
  • fDate
    27-28 Feb. 2009
  • Firstpage
    205
  • Lastpage
    209
  • Abstract
    Web is one of the biggest sources of information. The tremendous size, the dynamicity, and the structure of the Web have made the information retrieval process of the Web a challenging issue. Web search engines (WSEs) have started to help users with this matter. However, these types of application, to perform more effectively, always need current information about many characteristics of the Web. To determine these characteristics, one way is to use statistical sampling of the Web pages. In this kind of approaches, instead of analyzing a large number of Web pages, a rather smaller and more uniform set of Web pages is used. This research attempts to analyze the presented methods for generating uniform samples of the pages from the World Wide Web. It specifically focuses on a new method called BBS. Briefly, we improved BBS at least by 4.45% regarding the uniformity of the samples. Using this improved BBS, we estimated the size of the public indexable Web at 27.4 Billion pages. The index sizes of some commercial WSEs are also estimated and compared.
  • Keywords
    Internet; information retrieval; search engines; Web pages; Web search engines; World Wide Web; bucket based sampling; information retrieval process; uniform sampling; Content based retrieval; Equal opportunities; Information resources; Information retrieval; Sampling methods; Search engines; Testing; Web pages; Web search; Web sites; Uniform Sampling.; Web; Web Search Engine;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Communication Software and Networks, 2009. ICCSN '09. International Conference on
  • Conference_Location
    Macau
  • Print_ISBN
    978-0-7695-3522-7
  • Type

    conf

  • DOI
    10.1109/ICCSN.2009.164
  • Filename
    5076840