• DocumentCode
    2027455
  • Title

    A practical approach for ´zero´ downtime in an operational information system

  • Author

    Gavrilovska, Ada ; Schwan, Karsten ; Oleson, Van

  • Author_Institution
    Coll. of Comput., Georgia Inst. of Technol., Atlanta, GA, USA
  • fYear
    2002
  • fDate
    2002
  • Firstpage
    345
  • Lastpage
    352
  • Abstract
    An operational information system (OIS) supports a real-time view of an organization´s information critical to its logistical business operations. A central component of an OIS is an engine that integrates data events captured from distributed, remote sources in order to derive meaningful real-time views of current operations. This event derivation engine (EDE) continuously updates these views and also publishes them to a potentially large number of remote subscribers. The paper first describes a sample OIS and EDE in the context of an airline´s operations. It then defines the performance and availability requirements to be met by this system, specifically focusing on the EDE component. One particular requirement for the EDE is that subscribers to its output events should not experience downtime due to EDE failures, crashes or increased processing loads. Toward this end, we develop and evaluate a practical technique for masking failures and for hiding the costs of recovery from EDE subscribers. This technique utilizes redundant EDEs that coordinate view replicas with a relaxed synchronous fault tolerance protocol. A combination of pre- and post-buffering of replicas is used to attain a solution that offers low response times (i.e., ´zero´ downtime) while also preventing system failures in the presence of deterministic faults like ´ill-formed´ messages. Parallelism realized via a cluster machine and application-specific techniques for reducing synchronization across replicas are used to scale a ´zero´ downtime EDE to support the large number of subscribers it must service.
  • Keywords
    information systems; software fault tolerance; system recovery; travel industry; workstation clusters; Delta Air Lines; airline operations; application-specific techniques; availability requirements; cluster machine; continuous view updates; crashes; data event integration; deterministic faults; distributed remote sources; event derivation engine; failures; ill-formed messages; increased processing loads; logistical business operations; operational information system; parallelism; performance requirements; post-buffering; pre-buffering; real-time view; redundant event derivation engine; relaxed synchronous fault tolerance protocol; subscribers; synchronization reduction; view replicas; zero downtime; Airports; Availability; Computer architecture; Computer crashes; Delay; Engines; Hardware; Information systems; Parallel processing; Real time systems;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Distributed Computing Systems, 2002. Proceedings. 22nd International Conference on
  • ISSN
    1063-6927
  • Print_ISBN
    0-7695-1585-1
  • Type

    conf

  • DOI
    10.1109/ICDCS.2002.1022272
  • Filename
    1022272