DocumentCode :
1920416
Title :
On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-Based Fault Tolerance
Author :
Ibtesham, Dewan ; Arnold, Dorian ; Bridges, Patrick G. ; Ferreira, Kurt B. ; Brightwell, Ron
Author_Institution :
Dept. Of Comput. Sci., Univ. of New Mexico, Albuquerque, NM, USA
fYear :
2012
fDate :
10-13 Sept. 2012
Firstpage :
148
Lastpage :
157
Abstract :
The increasing size and complexity of high performance computing (HPC) systems have led to major concerns over fault frequencies and the mechanisms necessary to tolerate these faults. Previous studies have shown that state-of-the-field checkpoint/restart mechanisms will not scale sufficiently for future generation systems. Therefore, optimizations that reduce checkpoint overheads are necessary to keep checkpoint/restart mechanisms effective. In this work, we demonstrate that checkpoint data compression is a feasible mechanism for reducing checkpoint commit latencies and storage overheads. Leveraging a simple model for checkpoint compression viability, we show: (1) checkpoint data compression is feasible for many types of scientific applications expected to run on extreme scale systems, (2) checkpoint compression viability scales with checkpoint size, (3) user-level versus system-level checkpoints bears little impact on checkpoint compression viability, and (4) checkpoint compression viability scales with application process count. Lastly, we describe the impact that checkpoint compression might have on future generation extreme scale systems.
Keywords :
checkpointing; data compression; software fault tolerance; HPC system; checkpoint commit latency; checkpoint compression viability scale; checkpoint data compression; checkpoint overhead; checkpoint size; checkpoint/restart mechanism; checkpoint/restart-based fault tolerance; fault frequency; high performance computing; scientific application; storage overhead; system-level checkpoints; user-level checkpoints; Benchmark testing; Checkpointing; Compression algorithms; Data compression; Fault tolerance; Libraries; Mathematical model; Checkpoint Compression; Fault tolerance;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel Processing (ICPP), 2012 41st International Conference on
Conference_Location :
Pittsburgh, PA
ISSN :
0190-3918
Print_ISBN :
978-1-4673-2508-0
Type :
conf
DOI :
10.1109/ICPP.2012.45
Filename :
6337576
Link To Document :
بازگشت