DocumentCode :
3145919
Title :
A Fault-Tolerant High Performance Cloud Strategy for Scientific Computing
Author :
Okorafor, Ekpe
Author_Institution :
Comput. Sci. & Eng., African Univ. of Sci. & Technol. (AUST), Abuja, Nigeria
fYear :
2011
fDate :
16-20 May 2011
Firstpage :
1525
Lastpage :
1532
Abstract :
Scientific computing often requires the availability of a massive number of computers for performing large scale experiments. Traditionally, high-performance computing solutions and installed facilities such as clusters and super computers have been employed to address these needs. Cloud computing provides scientists with a completely new model of utilizing the computing infrastructure with the ability to perform parallel computations using large pools of virtual machines (VMs). The infrastructure services (Infrastructure-as-a-service), provided by these cloud vendors, allow any user to provision a large number of compute instances. However, scientific computing is typically characterized by complex communication patterns and requires optimized runtimes. Today, VMs are manually instantiated, configured and maintained by cloud users. These coupled with the latency, crash and omission failures in service providers, results in an inefficient use of VMs, increased complexity in VM-management tasks, a reduction in the overall computation power and increased time for task completion. In this paper, a high performance cloud computing strategy is proposed that combines the adaptation of a parallel processing framework, such as the Message Passing Interface (MPI) and an efficient checkpoint infrastructure for VMs, enabling its effective use for scientific computing. By developing such a mechanism, we can achieve optimized runtimes comparable to native clusters, improve checkpoints with low interference on task execution and provide efficient task recovery. In addition, check pointing is used to minimize the cost and volatility of resource provisioning, while improving overall reliability. Analysis and simulations show that the proposed approach compares favorably with the native cluster MPI implementations.
Keywords :
application program interfaces; checkpointing; cloud computing; fault tolerant computing; message passing; parallel machines; resource allocation; virtual machines; VM-management tasks; checkpoint infrastructure; cloud computing; cloud vendors; failures; fault-tolerant high performance cloud strategy; infrastructure services; interference; message passing interface; parallel computations; parallel processing; reliability; resource provisioning; scientific computing; task execution; task recovery; virtual machines; Checkpointing; Cloud computing; Computer architecture; Fault tolerance; Fault tolerant systems; Program processors; Protocols;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on
Conference_Location :
Shanghai
ISSN :
1530-2075
Print_ISBN :
978-1-61284-425-1
Electronic_ISBN :
1530-2075
Type :
conf
DOI :
10.1109/IPDPS.2011.306
Filename :
6009011
Link To Document :
بازگشت