• DocumentCode
    3501796
  • Title

    Increasing the cluster availability using RADIC

  • Author

    Duarte, Angelo ; Rexachs, Dolores ; Luque, Emilio

  • Author_Institution
    Dept. of Comput. Archit. & Oper. Syst., Univ. Autonoma de Barcelona
  • fYear
    2006
  • fDate
    25-28 Sept. 2006
  • Firstpage
    1
  • Lastpage
    8
  • Abstract
    The redundant array of distributed independent checkpoints (RADIC) is a fault tolerant architecture based on a fully distributed array of dedicated process. These processes collaborate to create a fault tolerance controller which transparently manages all fault tolerance activities. The architecture is designed as a software layer between the application and the cluster structure and it was developed to attend to the requirements of scalability, user transparency and independency of dedicated/stable cluster resources. RADIC only requires the resources already available in the nodes used by the parallel application and it uses a pessimistic message-log rollback-recovery protocol in order to operate without any global synchronization. Such protocol, together with the independence of central elements, makes RADIC a scalable architecture that works transparently to the user. We tested the functionality and performance of the architecture in a real scenario using a prototype based on the MPI standard (RADICMPI)
  • Keywords
    checkpointing; fault tolerance; fault tolerant computing; message passing; MPI standard; RADIC; cluster availability; distributed array; distributed independent checkpoints; fault tolerance controller; fault tolerant architecture; message-log rollback-recovery protocol; parallel application; redundant array; scalable architecture; software layer; Application software; Collaboration; Computer architecture; Costs; Fault tolerance; Fault tolerant systems; Operating systems; Programming profession; Protocols; Scalability;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cluster Computing, 2006 IEEE International Conference on
  • Conference_Location
    Barcelona
  • ISSN
    1552-5244
  • Print_ISBN
    1-4244-0327-8
  • Electronic_ISBN
    1552-5244
  • Type

    conf

  • DOI
    10.1109/CLUSTR.2006.311872
  • Filename
    4100378