Increasing the availability provided by RADIC with low overhead

Author

Santos, Guna ; Fialho, Leonardo ; Rexachs, Dolores ; Luque, Emilio

Author_Institution

Comput. Archit. & Oper. Syst. Dept., Univ. Autonoma de Barcelona, Barcelona, Spain

fYear

2009

fDate

Aug. 31 2009-Sept. 4 2009

Firstpage

1

Lastpage

8

Abstract

For machines composed of a large number of processing units, fault probability tends to increase linearly with this number. This makes the use of a fault tolerant solution a major issue. A fault tolerant solution provides certain level of availability, which is usually influenced by time overhead, performance degradation, resources or cost. In the rollback-recovery protocol, the availability increase is usually achieved by increasing the checkpoint frequency or by making several replicas of checkpoints and/or logs. Such a replication allows the solution to tolerate concurrent correlated faults, i.e., a fault in a computing node and in the stable storage. These faults are theoretically less probable, however recent studies have shown that faults are temporally and spatially correlated, consequently increasing the concurrent fault probability. The major concern replicating the checkpoints and logs is the overhead caused by storing these replicas over various repositories, which may disallow its use. In this paper we present how we increased the availability provided by RADIC, without significantly increase of its overhead. Our approach consists of parallelizing the storing of these replicas using the pipeline technique. Such a technique allows us to make low-overhead copies of checkpoints and logs over N protectors. Furthermore, as secondary benefit, the pipelining between observer and protector reduces more than four times (in the best case) the pessimistic message logging overhead.

Keywords

checkpointing; fault tolerant computing; message passing; pipeline processing; protocols; RADIC; checkpoint frequency; concurrent correlated fault tolerance; fault probability; message passing; performance degradation; pipeline technique; rollback-recovery protocol; system log; time overhead; Availability; Computer architecture; Concurrent computing; Costs; Fault tolerance; Hardware; Pipeline processing; Protection; Protocols; Redundancy;

fLanguage

English

Publisher

ieee

Conference_Titel

Cluster Computing and Workshops, 2009. CLUSTER '09. IEEE International Conference on

Conference_Location

New Orleans, LA

ISSN

1552-5244

Print_ISBN

978-1-4244-5011-4

Electronic_ISBN

1552-5244

Type

conf

DOI

10.1109/CLUSTR.2009.5289163

Filename

5289163