Title :
Nonblocking checkpointing for optimistic parallel simulation: description and an implementation
Author :
Quaglia, Francesco ; Santoro, Andrea
Author_Institution :
Diportimento di Informatica & Sistemistica, Univ. di Roma "La Sapienza", Italy
fDate :
6/1/2003 12:00:00 AM
Abstract :
Describes a nonblocking checkpointing mode in support of optimistic parallel discrete event simulation. This mode allows real concurrency in the execution of state saving and other simulation specific operations (e.g, event list update, event execution) with the aim of removing the cost of recording state information from the completion time of the parallel simulation application. We present an implementation of a C library supporting nonblocking checkpointing on a myrinet based cluster, which demonstrates the practical viability of this checkpointing mode on standard off-the-shelf hardware. By the results of an empirical study on classical parameterized synthetic benchmarks, we show that, except for the case of minimal state granularity applications, nonblocking checkpointing allows improvement of the speed of the parallel execution, as compared to commonly adopted, optimized checkpointing methods based on the classical blocking mode. A performance study for the case of a personal communication system (PCS) simulation is additionally reported to point out the benefits from nonblocking checkpointing for a real world application.
Keywords :
concurrency control; data integrity; data structures; discrete event simulation; message passing; system recovery; workstation clusters; C library; DMA; completion time; concurrency; discrete event simulation; minimal state granularity; myrinet based cluster; nonblocking checkpointing; optimistic parallel simulation; optimistic synchronization; performance optimization; personal communication system; state saving; Central Processing Unit; Checkpointing; Circuit simulation; Concurrent computing; Context modeling; Costs; Discrete event simulation; Hardware; Libraries; Personal communication networks;
Journal_Title :
Parallel and Distributed Systems, IEEE Transactions on
DOI :
10.1109/TPDS.2003.1206506