• DocumentCode
    2016999
  • Title

    Handling Persistent States in Process Checkpoint/Restart Mechanisms for HPC Systems

  • Author

    Riteau, Pierre ; Lebre, Adrien ; Morin, Christine

  • Author_Institution
    Centre Rennes - Bretagne Atlantique, INRIA, Rennes
  • fYear
    2009
  • fDate
    18-21 May 2009
  • Firstpage
    404
  • Lastpage
    411
  • Abstract
    Computer clusters are today the reference architecture for high-performance computing. The large number of nodes in these systems induces a high failure rate. This makes fault tolerance mechanisms, e.g. process checkpoint/restart, a required technology to effectively exploit clusters. Most of the process checkpoint/restart implementations only handle volatile states and do not take into account persistent states of applications, which can lead to incoherent application restarts. In this paper, we introduce an efficient persistent state checkpoint/restoration approach that can be interconnected with a large number of file systems. To avoid the performance issues of a stable support relying on synchronous replication mechanisms, we present a failure resilience scheme optimized for such persistent state checkpointing techniques in a distributed environment. First evaluations of our implementation in the kDFS distributed file system show the negligible performance impact of our proposal.
  • Keywords
    file organisation; grid computing; software fault tolerance; software maintenance; HPC systems; computer clusters; failure resilience scheme; fault tolerance mechanisms; high-performance computing; kDFS distributed file system; persistent state checkpointing technique; process checkpoint-restart mechanism; Checkpointing; Computer architecture; Contracts; Fault tolerance; Fault tolerant systems; File systems; Grid computing; Proposals; Registers; Resilience; Distributed architectures; Distributed file systems; High performance; Persistent state checkpointing; Process checkpoint/restart;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cluster Computing and the Grid, 2009. CCGRID '09. 9th IEEE/ACM International Symposium on
  • Conference_Location
    Shanghai
  • Print_ISBN
    978-1-4244-3935-5
  • Electronic_ISBN
    978-0-7695-3622-4
  • Type

    conf

  • DOI
    10.1109/CCGRID.2009.29
  • Filename
    5071898