• DocumentCode
    3080830
  • Title

    A popularity-based URL ordering algorithm for crawlers

  • Author

    Chandramouli, Aravind ; Gauch, Susan ; Eno, Joshua

  • Author_Institution
    Univ. of Kansas, Lawrence, KS, USA
  • fYear
    2010
  • fDate
    13-15 May 2010
  • Firstpage
    556
  • Lastpage
    562
  • Abstract
    Uniform Resource Locator (URL) ordering algorithms are used by Web crawlers to determine the order in which to download pages from the Web. The current approaches for URL ordering based on link structure are expensive and/or miss many good pages, particularly in social network environments. In this paper, we present a novel URL ordering algorithm that exploits the access count information present in the Web logs on the individual Websites. In particular, we develop algorithms based on internal and external counts and by using this popularity information for URL ordering, we are able to retrieve high quality pages earlier in the crawl. We perform our experiments on two data sets using the Web logs from university and CiteSeer Websites and, on these data sets, we achieve a statistically significant improvement in the ordering of the high quality pages (as indicated by Google´s PageRank) of 57.2% and 65.7% over that of a breadth-first search crawl.
  • Keywords
    Internet; Web sites; information retrieval; Web crawlers; Web logs; Web sites; popularity-based URL ordering algorithm; social network; uniform resource locator; Crawlers; Decision support systems; Helium; Information retrieval; Social network services; Uniform resource locators; page ranking; social content; url ordering; web crawler;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Human System Interactions (HSI), 2010 3rd Conference on
  • Conference_Location
    Rzeszow
  • Print_ISBN
    978-1-4244-7560-5
  • Type

    conf

  • DOI
    10.1109/HSI.2010.5514512
  • Filename
    5514512