• DocumentCode
    3607809
  • Title

    Optimizing big data processing performance in the public cloud: opportunities and approaches

  • Author

    Dan Wang ; Jiangchuan Liu

  • Author_Institution
    Dept. of Comput., Hong Kong Polytech. Univ., Hong Kong, China
  • Volume
    29
  • Issue
    5
  • fYear
    2015
  • Firstpage
    31
  • Lastpage
    35
  • Abstract
    Today´s lightning fast data generation from massive sources is calling for efficient big data processing, which imposes unprecedented demands on the computing and networking infrastructures. State-of-the-art tools, most notably MapReduce, are generally performed on dedicated server clusters to explore data parallelism. For grass roots users or non-computing professionals, the cost of deploying and maintaining a large-scale dedicated server clusters can be prohibitively high, not to mention the technical skills involved. On the other hand, public clouds allow general users to rent virtual machines and run their applications in a pay-as-you-go manner with ultra-high scalability with minimal upfront costs. This new computing paradigm has gained tremendous success in recent years, becoming a highly attractive alternative to dedicated server clusters. This article discusses the critical challenges and opportunities when big data meet the public cloud. We identify the key differences between running big data processing in a public cloud and in dedicated server clusters. We then present two important problems for efficient big data processing in the public cloud, resource provisioning (i.e., how to rent VMs) and VM-MapReduce job/task scheduling (i.e., how to run MapReduce after the VMs are constructed). Each of these two questions have a set of problems to solve. We present solution approaches for certain problems, and offer optimized design guidelines for others. Finally, we discuss our implementation experiences.
  • Keywords
    Big Data; cloud computing; parallel processing; virtual machines; Big Data processing; VM-MapReduce job/task scheduling; computing infrastructures; data generation; data parallelism; large-scale dedicated server clusters; networking infrastructures; performance optimization; public cloud; resource provisioning; virtual machines; Big data; Cloud computing; Data processing; Runtime; Servers; Virtualization;
  • fLanguage
    English
  • Journal_Title
    Network, IEEE
  • Publisher
    ieee
  • ISSN
    0890-8044
  • Type

    jour

  • DOI
    10.1109/MNET.2015.7293302
  • Filename
    7293302