A cloud-based web crawler architecture

Author

Bahrami, Mehdi ; Singhal, Mukesh ; Zixuan Zhuang

Author_Institution

Cloud Lab., Univ. of California, Merced, Merced, CA, USA

fYear

2015

fDate

17-19 Feb. 2015

Firstpage

216

Lastpage

223

Abstract

Web crawlers work on the behalf of applications or services to find interesting and related information on the web. For example, search engines use web crawlers to index the Internet. Web crawlers have several challenges, such as complexity between links and highly intensive computation requirements when a web crawler wants to retrieve complex connected links. Another issue is the storage of a massive amount of indexed links or downloaded unstructured data, such as binary files, videos or images. As the volume of information on the Internet increases rapidly and requests may search data in a variety of formats including unstructured data, no cloud-based architecture exists in the literatures for web crawlers that could effectively address both highly intensive computing and storage issues. The cloud computing paradigm provides support for elastic resources and unstructured data, and provides pay-peruse features that allow individual businesses to run their own web crawlers for crawling the Internet or a limited web hosts. In this paper, we propose a cloud-based web crawler architecture that uses cloud computing features and the MapReduce programming technique. The proposed web crawler allows us to crawl the web by using distributed agents and each agent stores its own finding on a Cloud Azure Table (NoSQL database). The proposed web crawler also could store unstructured and massive amount of data on Azure Blob storage. We analyze the performance and scalability of the proposed web crawler and we describe the advantages of the proposed web crawler over traditional distributed web crawlers.

Keywords

Big Data; SQL; cloud computing; information retrieval; parallel programming; search engines; storage management; Azure Blob storage; Cloud Azure Table; Internet; MapReduce programming technique; NoSQL database; Web hosts; cloud computing paradigm; cloud-based Web crawler architecture; complex connected links; distributed agents; downloaded unstructured data; elastic resources; indexed links; pay-per-use features; performance analysis; search engines; Cloud computing; Computer architecture; Crawlers; Servers; Service-oriented architecture; Uniform resource locators; big data; cloud computing; cloud-based web crawler; multimedia web crawler; web crawler;

fLanguage

English

Publisher

ieee

Conference_Titel

Intelligence in Next Generation Networks (ICIN), 2015 18th International Conference on

Conference_Location

Paris

Type

conf

DOI

10.1109/ICIN.2015.7073834

Filename

7073834