A practical approach for ´zero´ downtime in an operational information system

Author

Gavrilovska, Ada ; Schwan, Karsten ; Oleson, Van

Author_Institution

Coll. of Comput., Georgia Inst. of Technol., Atlanta, GA, USA

fYear

2002

fDate

2002

Firstpage

345

Lastpage

352

Abstract

An operational information system (OIS) supports a real-time view of an organization´s information critical to its logistical business operations. A central component of an OIS is an engine that integrates data events captured from distributed, remote sources in order to derive meaningful real-time views of current operations. This event derivation engine (EDE) continuously updates these views and also publishes them to a potentially large number of remote subscribers. The paper first describes a sample OIS and EDE in the context of an airline´s operations. It then defines the performance and availability requirements to be met by this system, specifically focusing on the EDE component. One particular requirement for the EDE is that subscribers to its output events should not experience downtime due to EDE failures, crashes or increased processing loads. Toward this end, we develop and evaluate a practical technique for masking failures and for hiding the costs of recovery from EDE subscribers. This technique utilizes redundant EDEs that coordinate view replicas with a relaxed synchronous fault tolerance protocol. A combination of pre- and post-buffering of replicas is used to attain a solution that offers low response times (i.e., ´zero´ downtime) while also preventing system failures in the presence of deterministic faults like ´ill-formed´ messages. Parallelism realized via a cluster machine and application-specific techniques for reducing synchronization across replicas are used to scale a ´zero´ downtime EDE to support the large number of subscribers it must service.

Keywords

information systems; software fault tolerance; system recovery; travel industry; workstation clusters; Delta Air Lines; airline operations; application-specific techniques; availability requirements; cluster machine; continuous view updates; crashes; data event integration; deterministic faults; distributed remote sources; event derivation engine; failures; ill-formed messages; increased processing loads; logistical business operations; operational information system; parallelism; performance requirements; post-buffering; pre-buffering; real-time view; redundant event derivation engine; relaxed synchronous fault tolerance protocol; subscribers; synchronization reduction; view replicas; zero downtime; Airports; Availability; Computer architecture; Computer crashes; Delay; Engines; Hardware; Information systems; Parallel processing; Real time systems;

fLanguage

English

Publisher

ieee

Conference_Titel

Distributed Computing Systems, 2002. Proceedings. 22nd International Conference on

ISSN

1063-6927

Print_ISBN

0-7695-1585-1

Type

conf

DOI

10.1109/ICDCS.2002.1022272

Filename

1022272