• DocumentCode
    1926051
  • Title

    Increasing the availability provided by RADIC with low overhead

  • Author

    Santos, Guna ; Fialho, Leonardo ; Rexachs, Dolores ; Luque, Emilio

  • Author_Institution
    Comput. Archit. & Oper. Syst. Dept., Univ. Autonoma de Barcelona, Barcelona, Spain
  • fYear
    2009
  • fDate
    Aug. 31 2009-Sept. 4 2009
  • Firstpage
    1
  • Lastpage
    8
  • Abstract
    For machines composed of a large number of processing units, fault probability tends to increase linearly with this number. This makes the use of a fault tolerant solution a major issue. A fault tolerant solution provides certain level of availability, which is usually influenced by time overhead, performance degradation, resources or cost. In the rollback-recovery protocol, the availability increase is usually achieved by increasing the checkpoint frequency or by making several replicas of checkpoints and/or logs. Such a replication allows the solution to tolerate concurrent correlated faults, i.e., a fault in a computing node and in the stable storage. These faults are theoretically less probable, however recent studies have shown that faults are temporally and spatially correlated, consequently increasing the concurrent fault probability. The major concern replicating the checkpoints and logs is the overhead caused by storing these replicas over various repositories, which may disallow its use. In this paper we present how we increased the availability provided by RADIC, without significantly increase of its overhead. Our approach consists of parallelizing the storing of these replicas using the pipeline technique. Such a technique allows us to make low-overhead copies of checkpoints and logs over N protectors. Furthermore, as secondary benefit, the pipelining between observer and protector reduces more than four times (in the best case) the pessimistic message logging overhead.
  • Keywords
    checkpointing; fault tolerant computing; message passing; pipeline processing; protocols; RADIC; checkpoint frequency; concurrent correlated fault tolerance; fault probability; message passing; performance degradation; pipeline technique; rollback-recovery protocol; system log; time overhead; Availability; Computer architecture; Concurrent computing; Costs; Fault tolerance; Hardware; Pipeline processing; Protection; Protocols; Redundancy;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cluster Computing and Workshops, 2009. CLUSTER '09. IEEE International Conference on
  • Conference_Location
    New Orleans, LA
  • ISSN
    1552-5244
  • Print_ISBN
    978-1-4244-5011-4
  • Electronic_ISBN
    1552-5244
  • Type

    conf

  • DOI
    10.1109/CLUSTR.2009.5289163
  • Filename
    5289163