Title :
Group-based Coordinated Checkpointing for MPI: A Case Study on InfiniBand
Author :
Gao, Qi ; Huang, Wei ; Koop, Matthew J. ; Panda, Dhabaleswar K.
Author_Institution :
Comput. Sci. & Eng.,, Ohio State Univ. Columbus, Columbus, OH
Abstract :
As more and more clusters with thousands of nodes are being deployed for high performance computing (HPC), fault tolerance in cluster environments has become a critical requirement. Checkpointing and rollback recovery is a common approach to achieve fault tolerance. Although widely adopted in practice, coordinated checkpointing has a known limitation on scalability. Severe contention for bandwidth to storage system can occur as a large number of processes take a checkpoint at the same time, resulting in an extremely long checkpointing delay for large parallel applications. In this paper, we propose a novel group-based checkpointing design to alleviate this scalability limitation. By carefully scheduling the MPI processes to take checkpoints in smaller groups, our design reduces the number of processes simultaneously taking checkpoints, while allowing those processes not taking checkpoints to proceed with computation. We implement our design and carry out a detailed evaluation with micro-benchmarks, HPL, and the parallel version of a data mining toolkit, MotifMiner. Experimental results show our group-based checkpointing design can reduce the effective delay for checkpointing significantly, up to 78% for HPL and up to 70% for MotifMiner.
Keywords :
application program interfaces; checkpointing; data mining; fault tolerant computing; message passing; scheduling; InfiniBand; MotifMiner; cluster environments; data mining toolkit; fault tolerance; group-based coordinated checkpointing; high performance computing; rollback recovery; scalability limitation; Bandwidth; Checkpointing; Computer networks; Computer science; Concurrent computing; Delay effects; Fault tolerance; High performance computing; Scalability; Sun;
Conference_Titel :
Parallel Processing, 2007. ICPP 2007. International Conference on
Conference_Location :
Xi´an
Print_ISBN :
978-0-7695-2933-2
DOI :
10.1109/ICPP.2007.44