Accelerating MapReduce Analytics Using CometCloud

Author

AbdelBaky, Moustafa ; Kim, Hyunjoo ; Rodero, Ivan ; Parashar, Manish

Author_Institution

NSF Autonomic & Cloud Comput. Center, Rutgers Univ., Piscataway, NJ, USA

fYear

2012

fDate

24-29 June 2012

Firstpage

447

Lastpage

454

Abstract

MapReduce-Hadoop has emerged as an effective framework for large-scale data analytics, providing support for executing jobs and storing data in a parallel and distributed manner. MapReduce has been shown to perform very well on large datacenters running applications where the data can be effectively divided into homogeneous chunks running across homogeneous hardware. However, the performance of MapReduceHadoop is far from ideal when either or both hardware and datasets are heterogeneous. Such heterogeneity is unavoidable in many academic computing environments that use multiple generations of hardware, and share resources among users. Heterogeneity is also unavoidable in scientific applications that process a varying number of datasets of different sizes. In these cases, the performance of MapReduce-Hadoop can be a concern. In this paper, we implement MapReduce on top of CometCloud to address the issue of heterogeneity and support applications classes that involve irregular datasets (e.g. large number of small data files or datasets of varying sizes). Furthermore, we develop an autonomic manager that can schedule MapReduce tasks based on user objective, provision resources accordingly, and support on-demand scale up and cloudbursts. These resources can be selected from a hybrid infrastructure such as local clusters, data centers, and public clouds. The performance of the developed solution is verified using a protein data mining application operating on data from the Protein Data Bank. The application is deployed, based on deadline and budget constraints, on a cluster at Rutgers and/or Amazon EC2 resources. The experimental results show that the MapReduce-CometCloud framework can effectively support applications operating on large numbers of small data files on a heterogeneous and distributed environment, and satisfy user objective autonomically using cloudbursts.

Keywords

biology computing; cloud computing; data analysis; data mining; proteins; scheduling; storage management; Amazon EC2 resources; MapReduce analytics acceleration; MapReduce task scheduling; MapReduce-CometCloud framework; MapReduce-Hadoop; Rutgers; autonomic manager; data storage; datacenters; heterogeneity issue; job execution; large-scale data analytics; local clusters; protein data bank; protein data mining application; public clouds; Cloud computing; Hardware; Monitoring; Programming; Proteins; Runtime; Schedules;

fLanguage

English

Publisher

ieee

Conference_Titel

Cloud Computing (CLOUD), 2012 IEEE 5th International Conference on

Conference_Location

Honolulu, HI

ISSN

2159-6182

Print_ISBN

978-1-4673-2892-0

Type

conf

DOI

10.1109/CLOUD.2012.150

Filename

6253537