• DocumentCode
    3264343
  • Title

    Adaptive focused crawling based on link analysis

  • Author

    Hati, Debashis ; Sahoo, Biswajit ; Kumar, Amritesh

  • Author_Institution
    Sch. Of Comput. Eng., KIIT Univ., Bhubaneswar, India
  • Volume
    4
  • fYear
    2010
  • fDate
    22-24 June 2010
  • Abstract
    A web search engine is designed to search for information on the World Wide Web (WWW). Crawlers are software which can traverse the internet and retrieve web pages by hyperlinks. In the face of the large spam websites, traditional web crawlers cannot function well to solve this problem. Focused crawlers utilize semantic web technologies to analyze the semantics of hyperlinks and web documents. The focused crawler of a special-purpose search engine aims to selectively seek out pages that are relevant to a pre-defined set of topics, rather than to exploit all regions of the Web. A focused crawler is a program used for searching information related to some interested topics from the Internet. The main property of focused crawling is that the crawler does not need to collect all web pages, but selects and retrieves relevant pages only. As the crawler is only a computer program, it cannot determine how relevant a web page is. The major problem is how to retrieve the maximal set of relevant and quality page. In our proposed approach, we calculate the unvisited URL score based on its Anchor text relevancy, its description in Google search engine and calculate the similarity score of description with topic keywords, cohesive text similarity with topic keywords and Relevancy score of its parent pages. Relevancy score is calculated based on vector space model.
  • Keywords
    Web sites; online front-ends; search engines; semantic Web; Anchor text relevancy; Google search engine; Internet; URL score; Web crawler; Web document; Web page; Web search engine; World Wide Web; adaptive focused crawling; cohesive text similarity; focused crawler; hyperlinks; link analysis; relevancy score; semantic Web technology; similarity score; spam Web sites; special-purpose search engine; topic keywords; vector space model; Computer science education; Crawlers; Design engineering; Internet; Search engines; Uniform resource locators; Web pages; Web server; Web sites; World Wide Web; crawler; focused crawler; vector space model;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Education Technology and Computer (ICETC), 2010 2nd International Conference on
  • Conference_Location
    Shanghai
  • Print_ISBN
    978-1-4244-6367-1
  • Type

    conf

  • DOI
    10.1109/ICETC.2010.5529641
  • Filename
    5529641