• DocumentCode
    2263242
  • Title

    A cloud-based web crawler architecture

  • Author

    Bahrami, Mehdi ; Singhal, Mukesh ; Zixuan Zhuang

  • Author_Institution
    Cloud Lab., Univ. of California, Merced, Merced, CA, USA
  • fYear
    2015
  • fDate
    17-19 Feb. 2015
  • Firstpage
    216
  • Lastpage
    223
  • Abstract
    Web crawlers work on the behalf of applications or services to find interesting and related information on the web. For example, search engines use web crawlers to index the Internet. Web crawlers have several challenges, such as complexity between links and highly intensive computation requirements when a web crawler wants to retrieve complex connected links. Another issue is the storage of a massive amount of indexed links or downloaded unstructured data, such as binary files, videos or images. As the volume of information on the Internet increases rapidly and requests may search data in a variety of formats including unstructured data, no cloud-based architecture exists in the literatures for web crawlers that could effectively address both highly intensive computing and storage issues. The cloud computing paradigm provides support for elastic resources and unstructured data, and provides pay-peruse features that allow individual businesses to run their own web crawlers for crawling the Internet or a limited web hosts. In this paper, we propose a cloud-based web crawler architecture that uses cloud computing features and the MapReduce programming technique. The proposed web crawler allows us to crawl the web by using distributed agents and each agent stores its own finding on a Cloud Azure Table (NoSQL database). The proposed web crawler also could store unstructured and massive amount of data on Azure Blob storage. We analyze the performance and scalability of the proposed web crawler and we describe the advantages of the proposed web crawler over traditional distributed web crawlers.
  • Keywords
    Big Data; SQL; cloud computing; information retrieval; parallel programming; search engines; storage management; Azure Blob storage; Cloud Azure Table; Internet; MapReduce programming technique; NoSQL database; Web hosts; cloud computing paradigm; cloud-based Web crawler architecture; complex connected links; distributed agents; downloaded unstructured data; elastic resources; indexed links; pay-per-use features; performance analysis; search engines; Cloud computing; Computer architecture; Crawlers; Servers; Service-oriented architecture; Uniform resource locators; big data; cloud computing; cloud-based web crawler; multimedia web crawler; web crawler;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Intelligence in Next Generation Networks (ICIN), 2015 18th International Conference on
  • Conference_Location
    Paris
  • Type

    conf

  • DOI
    10.1109/ICIN.2015.7073834
  • Filename
    7073834