DocumentCode
1926051
Title
Increasing the availability provided by RADIC with low overhead
Author
Santos, Guna ; Fialho, Leonardo ; Rexachs, Dolores ; Luque, Emilio
Author_Institution
Comput. Archit. & Oper. Syst. Dept., Univ. Autonoma de Barcelona, Barcelona, Spain
fYear
2009
fDate
Aug. 31 2009-Sept. 4 2009
Firstpage
1
Lastpage
8
Abstract
For machines composed of a large number of processing units, fault probability tends to increase linearly with this number. This makes the use of a fault tolerant solution a major issue. A fault tolerant solution provides certain level of availability, which is usually influenced by time overhead, performance degradation, resources or cost. In the rollback-recovery protocol, the availability increase is usually achieved by increasing the checkpoint frequency or by making several replicas of checkpoints and/or logs. Such a replication allows the solution to tolerate concurrent correlated faults, i.e., a fault in a computing node and in the stable storage. These faults are theoretically less probable, however recent studies have shown that faults are temporally and spatially correlated, consequently increasing the concurrent fault probability. The major concern replicating the checkpoints and logs is the overhead caused by storing these replicas over various repositories, which may disallow its use. In this paper we present how we increased the availability provided by RADIC, without significantly increase of its overhead. Our approach consists of parallelizing the storing of these replicas using the pipeline technique. Such a technique allows us to make low-overhead copies of checkpoints and logs over N protectors. Furthermore, as secondary benefit, the pipelining between observer and protector reduces more than four times (in the best case) the pessimistic message logging overhead.
Keywords
checkpointing; fault tolerant computing; message passing; pipeline processing; protocols; RADIC; checkpoint frequency; concurrent correlated fault tolerance; fault probability; message passing; performance degradation; pipeline technique; rollback-recovery protocol; system log; time overhead; Availability; Computer architecture; Concurrent computing; Costs; Fault tolerance; Hardware; Pipeline processing; Protection; Protocols; Redundancy;
fLanguage
English
Publisher
ieee
Conference_Titel
Cluster Computing and Workshops, 2009. CLUSTER '09. IEEE International Conference on
Conference_Location
New Orleans, LA
ISSN
1552-5244
Print_ISBN
978-1-4244-5011-4
Electronic_ISBN
1552-5244
Type
conf
DOI
10.1109/CLUSTR.2009.5289163
Filename
5289163
Link To Document