مرکز منطقه ای اطلاع رساني علوم و فناوري - NCAPS: application high availability in Unix computer clusters

DocumentCode :

3176853

Title :

NCAPS: application high availability in Unix computer clusters

Author :

Laranjeira, Luiz A.

Author_Institution :

Commun. Products Group, Tandem Comput. Inc., Austin, TX, USA

fYear :

1998

fDate :

23-25 June 1998

Firstpage :

441

Lastpage :

450

Abstract :

The paper presents a solution for improving the availability of applications running on a Unix computer cluster with two or more nodes. Tandem´s NCAPS (NonStop Clusters Application Protection System) consists of specialized system software that is capable of recovering applications after hardware, software or operating system failures. The main component of NCAPS, the PPM (Process Pairs Manager), uses a primary and warm backup approach to achieve recovery times in the range of 10 seconds (for nodes having access to all needed resources) regardless of the application initialization time. This is a clear improvement over recovery times provided by existing high availability (HA) solutions, which are typically in the order of 1 minute plus the application reinitialization time. The PPM manages an application through a configurable user-specified state model in which state changes are triggered by detected failures or system administrator commands. Upon a state transition the PPM sends a state change command message to registered application processes. Communication between the application processes and the PPM is achieved through a set of API (application programming interface) calls provided by the OftLib (Open Fault Tolerance Library), also called FT-API. NCAPS is now available on Unix clusters composed of Tandem S4000 machines. A version to run on Tandem SSI (Single System Image) product NSC (NonStop Clusters) for a cluster of Compaq Proliant machines is under development.

Keywords :

Unix; application program interfaces; software fault tolerance; system recovery; API calls; NonStop Clusters Application Protection System; OftLib; PPM; Process Pairs Manager; Tandem NCAPS; Tandem S4000 machines; Unix computer clusters; application high availability; application recovery; application reinitialization time; communication; configurable user-specified state model; detected failures; hardware failure; operating system failure; primary backup approach; registered application processes; software failure; specialized system software; state change command message; state transition; system administrator commands; warm backup approach; Application software; Availability; Fault tolerance; Hardware; Libraries; Operating systems; Protection; Resource management; Software systems; System software;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Fault-Tolerant Computing, 1998. Digest of Papers. Twenty-Eighth Annual International Symposium on

Conference_Location :

Munich, Germany

ISSN :

0731-3071

Print_ISBN :

0-8186-8470-4

Type :

conf

DOI :

10.1109/FTCS.1998.689496

Filename :

689496

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3176853