• DocumentCode
    2785270
  • Title

    Accelerating MapReduce Analytics Using CometCloud

  • Author

    AbdelBaky, Moustafa ; Kim, Hyunjoo ; Rodero, Ivan ; Parashar, Manish

  • Author_Institution
    NSF Autonomic & Cloud Comput. Center, Rutgers Univ., Piscataway, NJ, USA
  • fYear
    2012
  • fDate
    24-29 June 2012
  • Firstpage
    447
  • Lastpage
    454
  • Abstract
    MapReduce-Hadoop has emerged as an effective framework for large-scale data analytics, providing support for executing jobs and storing data in a parallel and distributed manner. MapReduce has been shown to perform very well on large datacenters running applications where the data can be effectively divided into homogeneous chunks running across homogeneous hardware. However, the performance of MapReduceHadoop is far from ideal when either or both hardware and datasets are heterogeneous. Such heterogeneity is unavoidable in many academic computing environments that use multiple generations of hardware, and share resources among users. Heterogeneity is also unavoidable in scientific applications that process a varying number of datasets of different sizes. In these cases, the performance of MapReduce-Hadoop can be a concern. In this paper, we implement MapReduce on top of CometCloud to address the issue of heterogeneity and support applications classes that involve irregular datasets (e.g. large number of small data files or datasets of varying sizes). Furthermore, we develop an autonomic manager that can schedule MapReduce tasks based on user objective, provision resources accordingly, and support on-demand scale up and cloudbursts. These resources can be selected from a hybrid infrastructure such as local clusters, data centers, and public clouds. The performance of the developed solution is verified using a protein data mining application operating on data from the Protein Data Bank. The application is deployed, based on deadline and budget constraints, on a cluster at Rutgers and/or Amazon EC2 resources. The experimental results show that the MapReduce-CometCloud framework can effectively support applications operating on large numbers of small data files on a heterogeneous and distributed environment, and satisfy user objective autonomically using cloudbursts.
  • Keywords
    biology computing; cloud computing; data analysis; data mining; proteins; scheduling; storage management; Amazon EC2 resources; MapReduce analytics acceleration; MapReduce task scheduling; MapReduce-CometCloud framework; MapReduce-Hadoop; Rutgers; autonomic manager; data storage; datacenters; heterogeneity issue; job execution; large-scale data analytics; local clusters; protein data bank; protein data mining application; public clouds; Cloud computing; Hardware; Monitoring; Programming; Proteins; Runtime; Schedules;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cloud Computing (CLOUD), 2012 IEEE 5th International Conference on
  • Conference_Location
    Honolulu, HI
  • ISSN
    2159-6182
  • Print_ISBN
    978-1-4673-2892-0
  • Type

    conf

  • DOI
    10.1109/CLOUD.2012.150
  • Filename
    6253537