• DocumentCode
    2758043
  • Title

    An Integrated Crawling Strategy for Domain-Specific Resource Discovery

  • Author

    Yuan, Richard ; Yin, Chunxia ; Liu, Jian ; Zhang, Yulian

  • Author_Institution
    Coll. of Inf. Sci. & Eng., Yanshan Univ., Qinhuangdao
  • fYear
    2007
  • fDate
    16-18 Dec. 2007
  • Firstpage
    329
  • Lastpage
    336
  • Abstract
    Topic-specific crawler aims to selectively seek out pages that are relevant to a pre-defined set of topics, rather than to exploit all regions of the Web. It is important for domain-specific resource discovery. Topic-specific crawlers yield good recall as well as good precision by restricting themselves to a specific domain from web pages. In this paper, we present an integrated topic-specific crawling strategy. The main features of the crawling process consist of a topic specification module that mediates between users and search engines to identify starting URLs by computing the hub score using BHIST algorithm, and a URL ordering algorithm that combines features of several previous approaches. Experimental results indicate that the new crawling method has better performance, and it was able to fetch higher topic relevant information.
  • Keywords
    Internet; information retrieval; search engines; user interfaces; BHIST algorithm; Internet; URL ordering algorithm; domain-specific resource discovery; hub score; integrated topic-specific crawler; search engine; user interface; Crawlers; Educational institutions; Information science; Internet; Navigation; Resource management; Search engines; Uniform resource locators; Web pages; Web sites; URL ordering; resource discovery; topic-specific crawler;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Signal-Image Technologies and Internet-Based System, 2007. SITIS '07. Third International IEEE Conference on
  • Conference_Location
    Shanghai
  • Print_ISBN
    978-0-7695-3122-9
  • Type

    conf

  • DOI
    10.1109/SITIS.2007.70
  • Filename
    4618793