• DocumentCode
    3310248
  • Title

    An Architectural Framework of a Crawler for Retrieving Highly Relevant Web Documents by Filtering Replicated Web Collections

  • Author

    Shekhar, Shashi ; Agrawal, Rohit ; Arya, Karm Veer

  • Author_Institution
    GLA Inst. of Technol. & Manage., Mathura, India
  • fYear
    2010
  • fDate
    20-21 June 2010
  • Firstpage
    29
  • Lastpage
    33
  • Abstract
    As the Web continues to grow, it has become a difficult task to search for the relevant information using traditional search engines. There are many index based web search engines to search information in various domains on the Web. By using such search engines the retrieved documents (URLs) related to the searched topic are of poor quality also as the amount of Web pages is growing at a rapid speed, the issue of devising a personalized Web search is of great importance. This paper proposes a method to reduce the time spend on browsing search results by providing a personalized Web Search Agent (MetaCrawler). In the proposed technique of personalized Web searching, Web pages relevant to user interests will be ranked in the front of the result list, thus facilitating the user to get a quick to get access those links ranked in the front of the list. An experiment was designed and conducted to test the performance of proposed Web-Filtering approach. The experimental results suggest substantial improvement in the crawling strategy, especially when the search strings are small.
  • Keywords
    Computer networks; Crawlers; Data mining; Information filtering; Information filters; Intelligent agent; Search engines; Uniform resource locators; Web pages; Web search; Link analysis; Search result ranking; Web IR; Web crawler; Web page classification;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Advances in Computer Engineering (ACE), 2010 International Conference on
  • Conference_Location
    Bangalore, Karnataka, India
  • Print_ISBN
    978-1-4244-7154-6
  • Type

    conf

  • DOI
    10.1109/ACE.2010.64
  • Filename
    5532879