• DocumentCode
    3145919
  • Title

    A Fault-Tolerant High Performance Cloud Strategy for Scientific Computing

  • Author

    Okorafor, Ekpe

  • Author_Institution
    Comput. Sci. & Eng., African Univ. of Sci. & Technol. (AUST), Abuja, Nigeria
  • fYear
    2011
  • fDate
    16-20 May 2011
  • Firstpage
    1525
  • Lastpage
    1532
  • Abstract
    Scientific computing often requires the availability of a massive number of computers for performing large scale experiments. Traditionally, high-performance computing solutions and installed facilities such as clusters and super computers have been employed to address these needs. Cloud computing provides scientists with a completely new model of utilizing the computing infrastructure with the ability to perform parallel computations using large pools of virtual machines (VMs). The infrastructure services (Infrastructure-as-a-service), provided by these cloud vendors, allow any user to provision a large number of compute instances. However, scientific computing is typically characterized by complex communication patterns and requires optimized runtimes. Today, VMs are manually instantiated, configured and maintained by cloud users. These coupled with the latency, crash and omission failures in service providers, results in an inefficient use of VMs, increased complexity in VM-management tasks, a reduction in the overall computation power and increased time for task completion. In this paper, a high performance cloud computing strategy is proposed that combines the adaptation of a parallel processing framework, such as the Message Passing Interface (MPI) and an efficient checkpoint infrastructure for VMs, enabling its effective use for scientific computing. By developing such a mechanism, we can achieve optimized runtimes comparable to native clusters, improve checkpoints with low interference on task execution and provide efficient task recovery. In addition, check pointing is used to minimize the cost and volatility of resource provisioning, while improving overall reliability. Analysis and simulations show that the proposed approach compares favorably with the native cluster MPI implementations.
  • Keywords
    application program interfaces; checkpointing; cloud computing; fault tolerant computing; message passing; parallel machines; resource allocation; virtual machines; VM-management tasks; checkpoint infrastructure; cloud computing; cloud vendors; failures; fault-tolerant high performance cloud strategy; infrastructure services; interference; message passing interface; parallel computations; parallel processing; reliability; resource provisioning; scientific computing; task execution; task recovery; virtual machines; Checkpointing; Cloud computing; Computer architecture; Fault tolerance; Fault tolerant systems; Program processors; Protocols;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on
  • Conference_Location
    Shanghai
  • ISSN
    1530-2075
  • Print_ISBN
    978-1-61284-425-1
  • Electronic_ISBN
    1530-2075
  • Type

    conf

  • DOI
    10.1109/IPDPS.2011.306
  • Filename
    6009011