Author :
Choudhari, Rahul ; Choudhari, R.D. ; Choudhari, Ajay
Abstract :
Notice of Violation of IEEE Publication Principles
"Increasing Search Engine Efficiency using Cooperative Web"
by Rahul Choudhari, Ajay Choudhari, R. D. Choudhari
in the Proceedings of the 2008 International Conference on Computer Science and Software Engineering (CSSE 2008), Wuhan, China, December 12, 2008
After careful and considered review of the content and authorship of this paper by a duly constituted expert committee, this paper has been found to be in violation of IEEE\´s Publication Principles.
This paper contains significant portions of original text from the paper cited below. The original text was copied without attribution (including appropriate references to the original author(s) and/or paper title) and without permission.
"Towards a Content-Provider-Friendly Web Page Crawler"
by Jie Xu, Qinglan Li, Huiming Qu, Alexandros Labrinidis,
in Proceedings of the 10th International Workshop on Web and Databases (WebDB 2007), Beijing, China, June 15, 2007
The performance of the search engine is mainly dependent on freshness of search enginepsilas index which maintains web content in the repository. The other is quality of the ranking algorithm or matching algorithm. The earlier factor is never ending quest because the content of the Web keep up changing after a particular time. Web crawler crawl Web pages and refreshes the index for search engine. To keep the freshness of the result by the search engine, crawling of the Web page should be fundamentally linked with the frequency updates of the Web pages. But the size of Web today and the inherent resource constraints: re-crawling too frequently leads to wasted bandwidth and re-crawling infrequently leads to the poor performance of the search engine. In this paper, we address the scheduling problem and a solution for the Web crawlers, with the objective of the optimizing the resources like freshness of repository and the quality of the index. Towards this we divi- ded the Web content providers into two parts: 1) active; 2) inactive. For inactive content providers we use agents who continuously crawls the content providers and collect the update pattern of the content providers. We also propose a scheduling scheme which capitalizes on the information given by the agents. Extensive experiments with real web traces demonstrate that it plays major role in improving the content quality of the index.