DocumentCode :
2570122
Title :
Evaluation of checkpoint mechanisms for massively parallel machines
Author :
Chiueh, Tzi-cker ; Deng, Peitao
Author_Institution :
Dept. of Comput. Sci., State Univ. of New York, Stony Brook, NY, USA
fYear :
1996
fDate :
25-27 Jun 1996
Firstpage :
370
Lastpage :
379
Abstract :
Massively parallel machines typically contain thousands of processor units and therefore are more likely to suffer system breakdown because of component failures. This paper studies efficient diskless checkpointing mechanisms for SIMD massively parallel machines. Three checkpointing schemes: mirror checkpointing, parity checkpointing, and partial parity checkpointing are compared in terms of their checkpoint performance and storage overheads, based on empirical measurements. Mirror checkpointing and parity checkpointing schemes have been successfully implemented and tested on a DECmpp 12000 machine, without hardware or OS modifications. It has been shown that mirror checkpointing is an order of magnitude faster than parity checkpointing, but takes twice as much storage overhead. Partial parity checkpointing, although significantly reduces the storage overhead, could lead to unpredictable execution performance. This paper also examines the detailed storage/performance tradeoffs for partial parity checkpointing through manual instrumentation, and describes the implementation experience from these experiments
Keywords :
DEC computers; fault tolerant computing; parallel algorithms; parallel machines; performance evaluation; system recovery; DECmpp 12000 machine; SIMD; checkpoint performance; checkpointing schemes; component failure; diskless checkpointing; massively parallel machines; mirror checkpointing; parity checkpointing; partial parity checkpointing; storage overhead; storage performance tradeoffs; system breakdown; unpredictable execution performance; Batteries; Checkpointing; Computer science; Concurrent computing; Electric breakdown; Hardware; Instruments; Mirrors; Parallel machines; Testing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Fault Tolerant Computing, 1996., Proceedings of Annual Symposium on
Conference_Location :
Sendai
ISSN :
0731-3071
Print_ISBN :
0-8186-7262-5
Type :
conf
DOI :
10.1109/FTCS.1996.534622
Filename :
534622
Link To Document :
بازگشت