DocumentCode :
2933482
Title :
A scalable double in-memory checkpoint and restart scheme towards exascale
Author :
Zheng, Gengbin ; Ni, Xiang ; Kalé, Laxmikant V.
Author_Institution :
Dept. of Comput. Sci., Univ. of Illinois at Urbana-Champaign, Urbana, IL, USA
fYear :
2012
fDate :
25-28 June 2012
Firstpage :
1
Lastpage :
6
Abstract :
As the size of supercomputers increases, the probability of system failure grows substantially, posing an increasingly significant challenge for scalability. It is important to provide resilience for long running applications. Checkpoint-based fault tolerance methods are effective approaches at dealing with faults. With these methods, the state of the entire parallel application is checkpointed to reliable storage. When a failure occurs, the application is restarted from a recent checkpoint. In previous work, we have demonstrated an efficient double in-memory checkpoint and restart fault tolerance scheme, which leverages Charm++´s parallel objects for checkpointing. In this paper, we further optimize the scheme by eliminating several bottlenecks caused by serialized communication. We extend the in-memory checkpointing scheme to work on MPI communication layer, and demonstrate the performance on very large scale supercomputers. For example, when running a one million atom molecular dynamics simulation on up to 64K cores of a BlueGene/P machine, the checkpoint time was in milliseconds. The restart time was measured to be less than 0.15 seconds on 64K cores.
Keywords :
application program interfaces; checkpointing; fault tolerant computing; mainframes; message passing; parallel processing; MPI communication layer; checkpoint-based fault tolerance methods; double in-memory checkpointing scheme; exascale; parallel application; restart scheme; very large scale supercomputers; Checkpointing; Computer crashes; Fault tolerance; Fault tolerant systems; Optimization; Program processors; Protocols;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Dependable Systems and Networks Workshops (DSN-W), 2012 IEEE/IFIP 42nd International Conference on
Conference_Location :
Boston, MA
Print_ISBN :
978-1-4673-2264-5
Electronic_ISBN :
978-1-4673-2265-2
Type :
conf
DOI :
10.1109/DSNW.2012.6264677
Filename :
6264677
Link To Document :
بازگشت