DocumentCode :
3543274
Title :
File I/O for MPI Applications in Redundant Execution Scenarios
Author :
Böhm, Swen ; Engelmann, Christian
Author_Institution :
Comput. Sci. & Math. Div., Oak Ridge Nat. Lab., Oak Ridge, TN, USA
fYear :
2012
fDate :
15-17 Feb. 2012
Firstpage :
112
Lastpage :
119
Abstract :
As multi-petascale and exa-scale high-performance computing (HPC) systems inevitably have to deal with a number of resilience challenges, such as a significant growth in component count and smaller circuit sizes with lower circuit voltages, redundancy may offer an acceptable level of resilience that traditional fault tolerance techniques, such as checkpoint/restart, do not. Although redundancy in HPC is quite controversial due to the associated cost for redundant components, the constantly increasing number of cores-per-processor is tilting this cost calculation toward a system design where computation, such as for redundancy, is much cheaper and communication, needed for checkpoint/restart, is much more expensive. Recent research and development activities in redundancy for Message Passing Interface (MPI) applications focused on availability/reliability models and replication algorithms. This paper takes a first step toward solving an open research problem associated with running a parallel application redundantly, which is file I/O under redundancy. The approach intercepts file I/O calls made by a redundant application to employ coordination protocols that execute file I/O operations in a redundancy-oblivious fashion when accessing a node-local file system, or in a redundancy-aware fashion when accessing a shared networked file system. A proof-of concept prototype is presented and a number of coordination protocols are described and evaluated. The results show the performance impact for redundantly accessing a shared networked file system, but also demonstrate the capability to regain performance by utilizing MPI communication between replicas and parallel file I/O.
Keywords :
fault tolerant computing; file organisation; message passing; parallel processing; shared memory systems; MPI; availability model; coordination protocols; exa-scale HPC; fault tolerance; file I/O; high-performance computing; message passing interface; multi-petascale HPC; node-local file system; parallel application; redundant execution scenarios; reliability model; replication algorithms; shared networked file system; Fault tolerant systems; Libraries; Protocols; Prototypes; Redundancy; Resilience; Message Passing Interface; fault tolerance; high-performance computing; redundancy; resilience;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel, Distributed and Network-Based Processing (PDP), 2012 20th Euromicro International Conference on
Conference_Location :
Garching
ISSN :
1066-6192
Print_ISBN :
978-1-4673-0226-5
Type :
conf
DOI :
10.1109/PDP.2012.22
Filename :
6169537
Link To Document :
بازگشت