DocumentCode :
2052227
Title :
Parallel I/O Performance for Application-Level Checkpointing on the Blue Gene/P System
Author :
Fu, Jing ; Min, Misun ; Latham, Robert ; Carothers, Christopher D.
Author_Institution :
Dept. of Comput. Sci., Rensselaer Polytech. Inst., Troy, NY, USA
fYear :
2011
fDate :
26-30 Sept. 2011
Firstpage :
465
Lastpage :
473
Abstract :
As the number of processors increases to hundreds of thousands in parallel computer architectures, the failure probability rises correspondingly, making fault tolerance a highly important and challenging task. Application-level checkpointing is one of the most popular techniques to proactively deal with unexpected failures because of its portability and flexibility. During the checkpoint phase, the local states of the computation spread across thousands of processors are saved to stable storage. Unfortunately, this approach results in heavy I/O load and can cause an I/O bottleneck in a massively parallel system. In this paper, we examine application-level checkpointing for a massively parallel electromagnetic solver system called NekCEM on the IBM Blue Gene/P at Argonne National Laboratory. We discuss an application-level, two-phase I/O approach, called "reduced-blocking I/O" (rbIO), and a tuned MPI-IO collective approach (coIO), and we demonstrate their performance advantage over the "1 POSIX file per processor" approach. Our study shows that rbIO and coIO result in 100vó improvement over previous checkpointing approaches on up to 65,536 processors of the Blue Gene/P using the GPFS. Our study also demonstrates a 25vó production performance improvement for NekCEM. We show how to optimize parameter settings for those parallel I/O approaches and to verify results by I/O profilings. In particular, we examine the performance advantage of rbIO and demonstrate the potential benefits of this approach over the traditional MPI-IO routine, coIO.
Keywords :
checkpointing; fault tolerance; parallel architectures; parallel machines; performance evaluation; probability; I/O bottleneck; IBM Blue Gene/P system; MPI-IO collective approach; NekCEM; POSIX file per processor; application-level checkpointing; failure probability; fault tolerance; heavy I/O load; massively parallel electromagnetic solver system; parallel I/O performance; parallel computer architecture; reduced-blocking I/O; Bandwidth; Checkpointing; Computer architecture; Operating systems; Production; Program processors; Semantics; Blue Gene/P; Parallel I/O; checkpointing; fault tolerance;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Cluster Computing (CLUSTER), 2011 IEEE International Conference on
Conference_Location :
Austin, TX
Print_ISBN :
978-1-4577-1355-2
Electronic_ISBN :
978-0-7695-4516-5
Type :
conf
DOI :
10.1109/CLUSTER.2011.81
Filename :
6061135
Link To Document :
بازگشت