Title :
NCAPS: application high availability in Unix computer clusters
Author :
Laranjeira, Luiz A.
Author_Institution :
Commun. Products Group, Tandem Comput. Inc., Austin, TX, USA
Abstract :
The paper presents a solution for improving the availability of applications running on a Unix computer cluster with two or more nodes. Tandem´s NCAPS (NonStop Clusters Application Protection System) consists of specialized system software that is capable of recovering applications after hardware, software or operating system failures. The main component of NCAPS, the PPM (Process Pairs Manager), uses a primary and warm backup approach to achieve recovery times in the range of 10 seconds (for nodes having access to all needed resources) regardless of the application initialization time. This is a clear improvement over recovery times provided by existing high availability (HA) solutions, which are typically in the order of 1 minute plus the application reinitialization time. The PPM manages an application through a configurable user-specified state model in which state changes are triggered by detected failures or system administrator commands. Upon a state transition the PPM sends a state change command message to registered application processes. Communication between the application processes and the PPM is achieved through a set of API (application programming interface) calls provided by the OftLib (Open Fault Tolerance Library), also called FT-API. NCAPS is now available on Unix clusters composed of Tandem S4000 machines. A version to run on Tandem SSI (Single System Image) product NSC (NonStop Clusters) for a cluster of Compaq Proliant machines is under development.
Keywords :
Unix; application program interfaces; software fault tolerance; system recovery; API calls; NonStop Clusters Application Protection System; OftLib; PPM; Process Pairs Manager; Tandem NCAPS; Tandem S4000 machines; Unix computer clusters; application high availability; application recovery; application reinitialization time; communication; configurable user-specified state model; detected failures; hardware failure; operating system failure; primary backup approach; registered application processes; software failure; specialized system software; state change command message; state transition; system administrator commands; warm backup approach; Application software; Availability; Fault tolerance; Hardware; Libraries; Operating systems; Protection; Resource management; Software systems; System software;
Conference_Titel :
Fault-Tolerant Computing, 1998. Digest of Papers. Twenty-Eighth Annual International Symposium on
Conference_Location :
Munich, Germany
Print_ISBN :
0-8186-8470-4
DOI :
10.1109/FTCS.1998.689496