DocumentCode :
2263242
Title :
A cloud-based web crawler architecture
Author :
Bahrami, Mehdi ; Singhal, Mukesh ; Zixuan Zhuang
Author_Institution :
Cloud Lab., Univ. of California, Merced, Merced, CA, USA
fYear :
2015
fDate :
17-19 Feb. 2015
Firstpage :
216
Lastpage :
223
Abstract :
Web crawlers work on the behalf of applications or services to find interesting and related information on the web. For example, search engines use web crawlers to index the Internet. Web crawlers have several challenges, such as complexity between links and highly intensive computation requirements when a web crawler wants to retrieve complex connected links. Another issue is the storage of a massive amount of indexed links or downloaded unstructured data, such as binary files, videos or images. As the volume of information on the Internet increases rapidly and requests may search data in a variety of formats including unstructured data, no cloud-based architecture exists in the literatures for web crawlers that could effectively address both highly intensive computing and storage issues. The cloud computing paradigm provides support for elastic resources and unstructured data, and provides pay-peruse features that allow individual businesses to run their own web crawlers for crawling the Internet or a limited web hosts. In this paper, we propose a cloud-based web crawler architecture that uses cloud computing features and the MapReduce programming technique. The proposed web crawler allows us to crawl the web by using distributed agents and each agent stores its own finding on a Cloud Azure Table (NoSQL database). The proposed web crawler also could store unstructured and massive amount of data on Azure Blob storage. We analyze the performance and scalability of the proposed web crawler and we describe the advantages of the proposed web crawler over traditional distributed web crawlers.
Keywords :
Big Data; SQL; cloud computing; information retrieval; parallel programming; search engines; storage management; Azure Blob storage; Cloud Azure Table; Internet; MapReduce programming technique; NoSQL database; Web hosts; cloud computing paradigm; cloud-based Web crawler architecture; complex connected links; distributed agents; downloaded unstructured data; elastic resources; indexed links; pay-per-use features; performance analysis; search engines; Cloud computing; Computer architecture; Crawlers; Servers; Service-oriented architecture; Uniform resource locators; big data; cloud computing; cloud-based web crawler; multimedia web crawler; web crawler;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Intelligence in Next Generation Networks (ICIN), 2015 18th International Conference on
Conference_Location :
Paris
Type :
conf
DOI :
10.1109/ICIN.2015.7073834
Filename :
7073834
Link To Document :
بازگشت