A Failure Recovery Solution for Transplanting High-Performance Data-Intensive Algorithms from the Cluster to the Cloud

Author

Da-Qi Ren ; Zane Wei

Author_Institution

US R&D Center, Huawei Technol., Santa Clara, CA, USA

fYear

2013

fDate

13-15 Nov. 2013

Firstpage

1463

Lastpage

1468

Abstract

The computing-cloud manages huge numbers of virtualized resources to provide uniquely beneficial computing paradigms for scientific research. A modern cloud can behave in a virtual context - much like a local homogeneous computer cluster - to deliver High Performance Computing (HPC) platforms that provide public users with access, cut purchase costs, and eliminate the maintenance burden of sophisticated hardware. For decades most distributed scientific computing software has been designed to run on clusters. Research on how to transplant cluster-based programs and performance-tuning mechanisms onto the cloud platform has gathered momentum in recent years. This paper introduces a fault tolerant approach that assures the reliability virtual clusters on clouds where high-performance and data-intensive computing paradigms are deployed. We have solved the failure recovery issue for TCP connections containing MPI error handlers by exploiting and modeling the constraints of low-level distributed resources. The combined MPI and TCP environment can support software development for multiple parallel programming models, including asynchronous distributed computing based on MPI for scientific HPC and synchronous distributed computing for big data, such as MapReduce and Pregal. This paper sets out detailed MPI/TCP fault-tolerant mechanisms, including primitives and functions. These elements enable the systematic and hierarchical development of a globally optimized HPC on the cloud platform.

Keywords

cloud computing; parallel processing; system recovery; virtualisation; HPC platforms; MPI error; MPI/TCP fault-tolerant mechanisms; TCP connections; TCP environment; asynchronous distributed computing; cloud computing; cloud platform; computer cluster; data intensive computing; distributed scientific computing software; failure recovery solution; high performance computing; multiple parallel programming models; performance tuning mechanisms; reliability virtual clusters; software development; sophisticated hardware; transplanting high performance data-intensive algorithms; virtual context; virtualized resources; Cloud computing; Computational modeling; Fault tolerance; Fault tolerant systems; Hardware; Virtual machining; Cloud Computing; Computing; Data-Intensive; Fault Tolerance; High-Performance Computing;

fLanguage

English

Publisher

ieee

Conference_Titel

High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (HPCC_EUC), 2013 IEEE 10th International Conference on

Conference_Location

Zhangjiajie

Type

conf

DOI

10.1109/HPCC.and.EUC.2013.207

Filename

6832089

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=688316