Title :
An experimental study about diskless checkpointing
Author :
Silva, Luis M. ; Silva, Joslo Gabriel
Author_Institution :
Dept. de Engenharia Inf., Coimbra Univ., Portugal
Abstract :
Checkpointing and rollback recovery is a very effective technique to tolerate the occurrence of failures. Usually, the checkpoint data is saved in some diskfiles. However, in some situations the disk operation may result in a considerable performance overhead. Alternative solutions would make use of main memory to maintain the checkpoint data. The paper presents two main memory check pointing schemes that can be used in any parallel machine without requiring any change to the hardware: one scheme saves the checkpoints in the memory of other processors, while the other is based on a parity approach. Both techniques have been implemented and evaluated in a commercial parallel machine. Some conclusions have been taken that clearly show the superiority of one of those schemes
Keywords :
fault tolerant computing; parallel machines; parallel programming; storage management; system recovery; checkpoint data; commercial parallel machine; disk operation; diskless checkpointing; experimental study; memory check pointing schemes; parity approach; performance overhead; rollback recovery; Checkpointing; Computer crashes; Fault tolerance; Hardware; Maintenance; Parallel machines; Random access memory; Read-write memory; Workstations; Writing;
Conference_Titel :
Euromicro Conference, 1998. Proceedings. 24th
Conference_Location :
Vasteras
Print_ISBN :
0-8186-8646-4
DOI :
10.1109/EURMIC.1998.711832