DocumentCode :
2176665
Title :
Reducing recovery time in a small recursively restartable system
Author :
Candea, George ; Cutler, James ; Fox, Armando ; Doshi, Rushabh ; Garg, Priyank ; Gowda, Rakesh
Author_Institution :
Stanford Univ., CA, USA
fYear :
2002
fDate :
2002
Firstpage :
605
Lastpage :
614
Abstract :
We present ideas on how to structure software systems for high availability by considering MTTR/MTTF characteristics of components in addition to the traditional criteria, such as functionality or state sharing. Recursive restartability (RR), a recently proposed technique for achieving high availability, exploits partial restarts at various levels within complex software infrastructures to recover from transient failures and rejuvenate software components. Here we refine the original proposal and apply the RR philosophy to Mercury, a COTS-based satellite ground station that has been in operation for over 2 years. We develop three techniques for transforming component group boundaries such that time-to-recover is reduced, hence increasing system availability. We also further RR by defining the notions of an oracle, restart group and restart policy, while showing how to reason about system properties in terms of restart groups. From our experience with applying RR to Mercury, we draw design guidelines and lessons for the systematic application of recursive restartability to other software systems amenable to RR.
Keywords :
aerospace computing; ground support systems; software reliability; system recovery; COTS-based satellite ground station; MTTR/MTTF characteristics; Mercury; complex software infrastructures; component group boundaries; functionality; high availability; oracle; partial restarts; recovery time reduction; restart group; restart policy; small recursively restartable system; software component rejuvenation; software systems; state sharing; transient failure recovery; Application software; Availability; Collaborative software; Communication system software; Guidelines; Proposals; Satellite antennas; Satellite broadcasting; Satellite ground stations; Software systems;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Dependable Systems and Networks, 2002. DSN 2002. Proceedings. International Conference on
Print_ISBN :
0-7695-1101-5
Type :
conf
DOI :
10.1109/DSN.2002.1029006
Filename :
1029006
Link To Document :
بازگشت