Title :
A WEBIR Crawling Framework for Retrieving Highly Relevant Web Documents: Evaluation Based on Rank Aggregation and Result Merging Algorithms
Author :
Shekhar, Shashi ; Arya, K.V. ; Agarwal, Rohit ; Kumar, Rakesh
Author_Institution :
GLA Univ., Mathura, India
Abstract :
Finding relevant information on the web is an ongoing problem. Commercial search engines like Google rely on sophisticated algorithms to index huge collection of web pages to make them accessible to user queries. Users, however, are still frequently overloaded with irrelevant results. The required information is available in replicated manner scattered in various disjoint databases. For effective web information retrieval, user need to consult several commercial search engines working on different architecture and principles. Rank aggregation and Result merging is the key component of a crawling mechanism used by the commercial search engines. Once the results from various search engines are collected, they need to be merged into a single unified ranked list. The effectiveness of any crawling mechanism is closely related to the rank aggregation and result merging algorithm it employs. In this paper, we investigate a variety of rank aggregation and result merging algorithms based on a wide range of available information. The effectiveness of these algorithms is then compared experimentally to our proposed crawling framework based on queries from the TREC Web track and 3 most popular general-purpose search engines. Our experiments yield two important results. First, simple result merging strategies can outperform Google, Yahoo and MSN Live. Second, Proposed Content Based Result Aggregation (CBRA) algorithm outperforms other existing content based merging algorithms based on full document content.
Keywords :
Internet; document handling; query processing; search engines; Google; MSN Live; TREC Web track; WEBIR crawling framework; Web document retrieval; Web information retrieval; Web pages; Yahoo; content based merging algorithms; content based result aggregation algorithm; crawling mechanism; general-purpose search engines; rank aggregation algorithm; result merging algorithm; Computer architecture; Corporate acquisitions; Crawlers; Engines; Merging; Search engines; Web pages; Rank Aggregation; Search result ranking; Web IR; Web crawler; Web page classification;
Conference_Titel :
Computational Intelligence and Communication Networks (CICN), 2011 International Conference on
Conference_Location :
Gwalior
Print_ISBN :
978-1-4577-2033-8
DOI :
10.1109/CICN.2011.17