A Fault-Tolerant High Performance Cloud Strategy for Scientific Computing

Author

Okorafor, Ekpe

Author_Institution

Comput. Sci. & Eng., African Univ. of Sci. & Technol. (AUST), Abuja, Nigeria

fYear

2011

fDate

16-20 May 2011

Firstpage

1525

Lastpage

1532

Abstract

Scientific computing often requires the availability of a massive number of computers for performing large scale experiments. Traditionally, high-performance computing solutions and installed facilities such as clusters and super computers have been employed to address these needs. Cloud computing provides scientists with a completely new model of utilizing the computing infrastructure with the ability to perform parallel computations using large pools of virtual machines (VMs). The infrastructure services (Infrastructure-as-a-service), provided by these cloud vendors, allow any user to provision a large number of compute instances. However, scientific computing is typically characterized by complex communication patterns and requires optimized runtimes. Today, VMs are manually instantiated, configured and maintained by cloud users. These coupled with the latency, crash and omission failures in service providers, results in an inefficient use of VMs, increased complexity in VM-management tasks, a reduction in the overall computation power and increased time for task completion. In this paper, a high performance cloud computing strategy is proposed that combines the adaptation of a parallel processing framework, such as the Message Passing Interface (MPI) and an efficient checkpoint infrastructure for VMs, enabling its effective use for scientific computing. By developing such a mechanism, we can achieve optimized runtimes comparable to native clusters, improve checkpoints with low interference on task execution and provide efficient task recovery. In addition, check pointing is used to minimize the cost and volatility of resource provisioning, while improving overall reliability. Analysis and simulations show that the proposed approach compares favorably with the native cluster MPI implementations.

Keywords

application program interfaces; checkpointing; cloud computing; fault tolerant computing; message passing; parallel machines; resource allocation; virtual machines; VM-management tasks; checkpoint infrastructure; cloud computing; cloud vendors; failures; fault-tolerant high performance cloud strategy; infrastructure services; interference; message passing interface; parallel computations; parallel processing; reliability; resource provisioning; scientific computing; task execution; task recovery; virtual machines; Checkpointing; Cloud computing; Computer architecture; Fault tolerance; Fault tolerant systems; Program processors; Protocols;

fLanguage

English

Publisher

ieee

Conference_Titel

Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on

Conference_Location

Shanghai

ISSN

1530-2075

Print_ISBN

978-1-61284-425-1

Electronic_ISBN

1530-2075

Type

conf

DOI

10.1109/IPDPS.2011.306

Filename

6009011