Title :
Fault-tolerance in a distributed management system: a case study
Author :
Smeikal, Robert ; Goeschka, Karl M.
Author_Institution :
Vienna Univ. of Technol., Austria
Abstract :
Our case study provides the most important conceptual lessons learned from the implementation of a Distributed Telecommunication Management System (DTMS), which controls a networked voice communication system. Major requirements for the DTMS are fault-tolerance against site or network failures, transactional safety, and reliable persistence. In order to provide distribution and persistence both transparently and fault-tolerant we introduce a two-layer architecture facilitating an asynchronous replication algorithm. Among the lessons learned are: component based software engineering poses a significant initial overhead but is worth it in the long term; a fault-tolerant naming service is a key requirement for fail-safe distribution; the reasonable granularity for persistence and concurrency control is one whole object; asynchronous replication on the database layer is superior to synchronous replication on the instance level in terms of robustness and consistency; semi-structured persistence with XML has drawbacks regarding consistency, performance and convenience; in contrast to an arbitrarily meshed object model, a accentuated hierarchical structure is more robust and feasible; a query engine has to provide a means for navigation through the object model; finally the propagation of deletion operation becomes more complex in an object-oriented model. By incorporating these lessons learned we are well underway to provide a highly available, distributed platform for persistent object systems.
Keywords :
distributed databases; distributed object management; object-oriented programming; software fault tolerance; voice communication; XML; asynchronous replication algorithm; component based software engineering; distributed telecommunication management system; fault-tolerance; naming service; networked voice communication system; object oriented programming; two-layer architecture; Communication system control; Control systems; Fault tolerance; Fault tolerant systems; Object oriented modeling; Robust control; Safety; Telecommunication control; Telecommunication network management; Telecommunication network reliability;
Conference_Titel :
Software Engineering, 2003. Proceedings. 25th International Conference on
Print_ISBN :
0-7695-1877-X
DOI :
10.1109/ICSE.2003.1201225