• DocumentCode
    2290975
  • Title

    Four Heuristics to Guide Structured Content Crawling

  • Author

    Umbrich, Jürgen ; Harth, Andreas ; Hogan, Aidan ; Decker, Stefan

  • Author_Institution
    Galway Digital Enterprise Res. Inst., Nat. Univ. of Ireland, Dublin
  • fYear
    2008
  • fDate
    14-18 July 2008
  • Firstpage
    196
  • Lastpage
    202
  • Abstract
    Search engines focusing on particular media types face difficulties in discovering suitable URIs on the Web. Since the engines are only interested in a small fraction of the Web, a crawler should use heuristics to concentrate on that fraction. To devise such a heuristic, we postulate four hypotheses based on RFCs and W3C recommendations to find cues for certain content types. Tests on a corpus of 22 m files (793 GB content size) containing 630 m URIs show that for the content types text, image, and application, the recommendations are mostly being followed, while results for audio and video are much less consistent. Our findings and recommendations can be implemented as heuristics for efficient discovery of structured content on the Web on top of existing crawlers.
  • Keywords
    Internet; search engines; Web; content size; heuristics; search engines; structured content crawling; Bandwidth; Crawlers; Frequency; HTML; Human immunodeficiency virus; Information analysis; Search engines; Semantic Web; Testing; Web search; heuristics; identifying media types; structured content Crawling; web survey;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Web Engineering, 2008. ICWE '08. Eighth International Conference on
  • Conference_Location
    Yorktown Heights, NJ
  • Print_ISBN
    978-0-7695-3261-5
  • Electronic_ISBN
    978-0-7695-3261-5
  • Type

    conf

  • DOI
    10.1109/ICWE.2008.42
  • Filename
    4577883