Title :
A two-level checkpoint algorithm in a highly-available parallel single level store system
Author :
Morin, Christine ; Lottiaux, Renaud ; Kermarrec, Anne-Marie
Author_Institution :
Rennes I Univ., France
Abstract :
A parallel single level store system (PSLS) integrates a shared virtual memory and a parallel file system. Managing the data globally it provides programmers of scientific applications with the attractive shared memory programming model combined with a large and efficient file system in a cluster. We present a cheap and efficient two-level checkpointing approach enabling a PSLS to tolerate failures. The first level checkpointing algorithm is very efficient and saves data in memory but requires a large amount of memory space. When memories are saturated, an alternative algorithm, saving a checkpoint on disks is implemented. Performance results present the impact of different variants of the checkpointing algorithms
Keywords :
parallel programming; shared memory systems; software fault tolerance; software performance evaluation; system recovery; virtual storage; fault tolerance; file system; global data management; parallel file system; parallel single level store system; performance results; scientific applications; shared memory programming model; shared virtual memory; two-level checkpoint algorithm; Checkpointing; Clustering algorithms; File systems; Memory management; Numerical simulation; Parallel programming; Programming profession; Support vector machines; Tiles; Workstations;
Conference_Titel :
Cluster Computing and the Grid, 2001. Proceedings. First IEEE/ACM International Symposium on
Conference_Location :
Brisbane, Qld.
Print_ISBN :
0-7695-1010-8
DOI :
10.1109/CCGRID.2001.923236