Title :
Virtualized dynamic URL assignment web crawling model
Author :
Bhaginath, Wani Rohit ; Shingade, Sandip ; Shirole, Mahesh
Author_Institution :
Dept. of CE &IT, V.J.T.I., Mumbai, India
Abstract :
Web search engines are software systems that help to retrieve the information from the net by accepting the input in the form of query and providing the result as files, pages, images or information. These search engines heavily rely on the web crawlers that interact with millions of the web pages given a seed URL or a list of seed URLs. However, these crawlers demand a large amount of computing resources. The efficiency of web search engines depends upon the performance of the crawling processes. Despite the continuous improvement in the crawling processes still there is a need of improvement towards more efficient and low cost crawler. Most of the crawlers existing today have a centralized coordinator that brings the disadvantage of single point failure. Taking into consideration the shortfalls of the existing crawlers, this paper proposes an architecture of a distributed web crawler. The architecture addresses two issues of the existing web crawlers: the first is to create a low cost web crawler using the concept of virtualization of cloud computing. The second issue is a balanced load distribution based on dynamic assignment of the URLs. The first issue is solved using mutli-core machines where each multi-core processor is divided into number of virtual machines (VM) that can perform different crawling task in parallel. Second issue is addressed using a clustering algorithm that assigns requests to the machines as per the availability of the clusters thereby realizing the balance among components according to their real-time condition. This paper discusses a distributed architecture and details of the implementation of the proposed algorithm.
Keywords :
Web sites; cloud computing; information retrieval; online front-ends; search engines; Web crawling model; Web pages; Web search engine; balanced load distribution; centralized coordinator; cloud computing; clustering algorithm; distributed Web crawler; distributed architecture; dynamic assignment; low cost Web crawler; multicore processor; mutlicore machines; software system; virtual machine; virtualized dynamic URL assignment; Computational modeling; Crawlers; HTML; Hardware; Pipeline processing; Software; Uniform resource locators; Clustering algorithm; Crawler; Dynamic assignment; K-means clustering; Seeds; Virtualization;
Conference_Titel :
Advances in Engineering and Technology Research (ICAETR), 2014 International Conference on
Conference_Location :
Unnao
DOI :
10.1109/ICAETR.2014.7012963