• DocumentCode
    1948237
  • Title

    Agnostic topology-based spam avoidance in large-scale web crawls

  • Author

    Sparkman, Clint ; Lee, Hsin-Tsang ; Loguinov, Dmitri

  • Author_Institution
    Texas A&M Univ., College Station, TX, USA
  • fYear
    2011
  • fDate
    10-15 April 2011
  • Firstpage
    811
  • Lastpage
    819
  • Abstract
    With the proliferation of web spam and questionable content with virtually infinite auto-generated structure, large-scale web crawlers now require low-complexity ranking methods to effectively budget their limited resources and allocate the majority of bandwidth to reputable sites. To shed light on Internet-wide spam avoidance, we study the domain-level graph from a 6.3B-page web crawl and compare several agnostic topology-based ranking algorithms on this dataset. We first propose a new methodology for comparing the various rankings and then show that in-degree BFS-based techniques decisively outperform classic PageRank-style methods. However, since BFS requires several orders of magnitude higher overhead and is generally infeasible for real-time use, we propose a fast, accurate, and scalable estimation method that can achieve much better crawl prioritization in practice, especially in applications with limited hardware resources.
  • Keywords
    Internet; security of data; unsolicited e-mail; BFS-based technique; PageRank-style method; Web crawl; Web spam; agnostic topology-based ranking algorithm; spam avoidance; Algorithm design and analysis; Crawlers; Electronic mail; Google; Internet; Manuals; Search engines;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    INFOCOM, 2011 Proceedings IEEE
  • Conference_Location
    Shanghai
  • ISSN
    0743-166X
  • Print_ISBN
    978-1-4244-9919-9
  • Type

    conf

  • DOI
    10.1109/INFCOM.2011.5935303
  • Filename
    5935303