Title :
Portable transparent checkpointing for distributed shared memory
Author :
Silva, Luis M. ; Silva, JoÃo Gabriel ; Chapple, Simon
Author_Institution :
Dept. de Engenharia Inf., Coimbra Univ., Portugal
Abstract :
We present a checkpointing mechanism for a DSM system that, in spite of being invisible to the programmer, is quite efficient and portable. It is efficient because it is nonblocking, coordinated and thus domino-effect free. It offers some portability because it is built on top of MPI and uses only the services offered by MPI and a POSIX compliant local file system. As far as we know, this is the first real implementation of such a scheme for DSM. Along with the description of the algorithms used, we present experimental results obtained in a cluster of workstations, and discuss many insights that came out of the implementation effort. We hope that our research shows that efficient, transparent and portable checkpointing is viable for DSM systems.
Keywords :
Unix; distributed memory systems; message passing; parallel algorithms; shared memory systems; software portability; system recovery; MPI; Message Passing Interface; POSIX compliant local file system; distributed shared memory systems; domino-effect free; nonblocking mechanism; parallel algorithms; portable transparent checkpointing; workstation cluster; Checkpointing; Clustering algorithms; Computer crashes; Distributed computing; Fault tolerant systems; File systems; Parallel machines; Programming profession; Scalability; Workstations;
Conference_Titel :
High Performance Distributed Computing, 1996., Proceedings of 5th IEEE International Symposium on
Conference_Location :
Syracuse, NY, USA
Print_ISBN :
0-8186-7582-9
DOI :
10.1109/HPDC.1996.546213