• DocumentCode
    688316
  • Title

    A Failure Recovery Solution for Transplanting High-Performance Data-Intensive Algorithms from the Cluster to the Cloud

  • Author

    Da-Qi Ren ; Zane Wei

  • Author_Institution
    US R&D Center, Huawei Technol., Santa Clara, CA, USA
  • fYear
    2013
  • fDate
    13-15 Nov. 2013
  • Firstpage
    1463
  • Lastpage
    1468
  • Abstract
    The computing-cloud manages huge numbers of virtualized resources to provide uniquely beneficial computing paradigms for scientific research. A modern cloud can behave in a virtual context - much like a local homogeneous computer cluster - to deliver High Performance Computing (HPC) platforms that provide public users with access, cut purchase costs, and eliminate the maintenance burden of sophisticated hardware. For decades most distributed scientific computing software has been designed to run on clusters. Research on how to transplant cluster-based programs and performance-tuning mechanisms onto the cloud platform has gathered momentum in recent years. This paper introduces a fault tolerant approach that assures the reliability virtual clusters on clouds where high-performance and data-intensive computing paradigms are deployed. We have solved the failure recovery issue for TCP connections containing MPI error handlers by exploiting and modeling the constraints of low-level distributed resources. The combined MPI and TCP environment can support software development for multiple parallel programming models, including asynchronous distributed computing based on MPI for scientific HPC and synchronous distributed computing for big data, such as MapReduce and Pregal. This paper sets out detailed MPI/TCP fault-tolerant mechanisms, including primitives and functions. These elements enable the systematic and hierarchical development of a globally optimized HPC on the cloud platform.
  • Keywords
    cloud computing; parallel processing; system recovery; virtualisation; HPC platforms; MPI error; MPI/TCP fault-tolerant mechanisms; TCP connections; TCP environment; asynchronous distributed computing; cloud computing; cloud platform; computer cluster; data intensive computing; distributed scientific computing software; failure recovery solution; high performance computing; multiple parallel programming models; performance tuning mechanisms; reliability virtual clusters; software development; sophisticated hardware; transplanting high performance data-intensive algorithms; virtual context; virtualized resources; Cloud computing; Computational modeling; Fault tolerance; Fault tolerant systems; Hardware; Virtual machining; Cloud Computing; Computing; Data-Intensive; Fault Tolerance; High-Performance Computing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (HPCC_EUC), 2013 IEEE 10th International Conference on
  • Conference_Location
    Zhangjiajie
  • Type

    conf

  • DOI
    10.1109/HPCC.and.EUC.2013.207
  • Filename
    6832089