Title :
The performance of consistent checkpointing in distributed shared memory systems
Author :
Cabillic, Gilbert ; Muller, Gilles ; Puaut, Isabelle
Author_Institution :
IRISA, Rennes, France
Abstract :
This paper presents the design and implementation of a consistent checkpointing scheme for distributed shared memory (DSM) systems. Our approach relies on the integration of checkpoints within synchronization barriers already existing in applications; this avoids the need to introduce an additional synchronization mechanism. The main advantage of our checkpointing mechanism is that performance degradation arises only when a checkpoint is being taken; hence, the programmer can adjust the trade-off between the cost of checkpointing and the cost of longer rollbacks by adjusting the time between two successive checkpoints. The paper compares several implementations of the proposed consistent checkpointing mechanism (incremental, non-blocking, and pre-flushing) on the Intel Paragon multicomputer for several parallel scientific applications. Performance measures show that a careful optimization of the checkpointing protocol can reduce the time overhead of checkpointing from 8% to 0.04% of the application duration for a 6 mn checkpointing interval
Keywords :
distributed memory systems; message passing; program debugging; shared memory systems; software performance evaluation; synchronisation; Intel Paragon multicomputer; consistent checkpointing; distributed shared memory systems; parallel scientific applications; performance; performance degradation; rollbacks; synchronization barriers; Checkpointing; Computer crashes; Costs; Degradation; Frequency synchronization; Hardware; Message passing; Protocols; Random access memory; Time measurement;
Conference_Titel :
Reliable Distributed Systems, 1995. Proceedings., 14th Symposium on
Conference_Location :
Bad Neuenahr
Print_ISBN :
0-8186-7153-X
DOI :
10.1109/RELDIS.1995.526217