DocumentCode :
3200143
Title :
Efficient Process Replication for MPI Applications: Sharing Work between Replicas
Author :
Ropars, Thomas ; Lefray, Arnaud ; Dohyun Kim ; Schiper, Andre
Author_Institution :
Ecole Polytech. Fed. de Lausanne (EPFL), Lausanne, Switzerland
fYear :
2015
fDate :
25-29 May 2015
Firstpage :
645
Lastpage :
654
Abstract :
With the increased failure rate expected in future extreme scale supercomputers, process replication might become a viable alternative to check pointing. By default, the workload efficiency of replication is limited to 50% because of the additional resources that have to be used to execute the replicas of the application´s processes. In this paper, we introduce intra-parallelization, a solution that avoids replicating all computation by introducing work-sharing between replicas. We show on a representative set of benchmarks that intra-parallelization allows achieving more than 50% efficiency without compromising fault tolerance.
Keywords :
application program interfaces; checkpointing; fault tolerant computing; message passing; parallel processing; MPI applications; checkpointing; extreme scale supercomputers; failure rate; intraparallelization; process replication; Checkpointing; Computer crashes; Context; Fault tolerance; Fault tolerant systems; Kernel; Protocols; High performance computing; fault tolerance; replication;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International
Conference_Location :
Hyderabad
ISSN :
1530-2075
Type :
conf
DOI :
10.1109/IPDPS.2015.29
Filename :
7161552
Link To Document :
بازگشت