• DocumentCode
    86516
  • Title

    Deploying Large-Scale Datasets on-Demand in the Cloud: Treats and Tricks on Data Distribution

  • Author

    Vaquero, Luis M. ; Celorio, Antonio ; Cuadrado, Felix ; Cuevas, Ruben

  • Author_Institution
    Hewlett-Packard Labs. Security & Cloud Lab., Bristol, UK
  • Volume
    3
  • Issue
    2
  • fYear
    2015
  • fDate
    April-June 1 2015
  • Firstpage
    132
  • Lastpage
    144
  • Abstract
    Public clouds have democratised the access to analytics for virtually any institution in the world. Virtual machines (VMs) can be provisioned on demand to crunch data after uploading into the VMs. While this task is trivial for a few tens of VMs, it becomes increasingly complex and time consuming when the scale grows to hundreds or thousands of VMs crunching tens or hundreds of TB. Moreover, the elapsed time comes at a price: the cost of provisioning VMs in the cloud and keeping them waiting to load the data. In this paper we present a big data provisioning service that incorporates hierarchical and peer-to-peer data distribution techniques to speed-up data loading into the VMs used for data processing. The system dynamically mutates the sources of the data for the VMs to speed-up data loading. We tested this solution with 1000 VMs and 100 TB of data, reducing time by at least 30 percent over current state of the art techniques. This dynamic topology mechanism is tightly coupled with classic declarative machine configuration techniques (the system takes a single high-level declarative configuration file and configures both software and data loading). Together, these two techniques simplify the deployment of big data in the cloud for end users who may not be experts in infrastructure management.
  • Keywords
    Big Data; cloud computing; peer-to-peer computing; virtual machines; VM; big data provisioning service; classic declarative machine configuration techniques; data loading; data processing; dynamic topology mechanism; high-level declarative configuration file; infrastructure management; large-scale datasets on-demand; peer-to-peer data distribution techniques; public clouds; virtual machines; Big data; Cloud computing; Distributed databases; Loading; Relays; Servers; BitTorrent; Large-scale data transfer; big data; big data distribution; flash crowd; p2p everyday; p2p overlay; provisioning;
  • fLanguage
    English
  • Journal_Title
    Cloud Computing, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    2168-7161
  • Type

    jour

  • DOI
    10.1109/TCC.2014.2360376
  • Filename
    6910293