Title :
Efficient Process Replication for MPI Applications: Sharing Work between Replicas
Author :
Ropars, Thomas ; Lefray, Arnaud ; Dohyun Kim ; Schiper, Andre
Author_Institution :
Ecole Polytech. Fed. de Lausanne (EPFL), Lausanne, Switzerland
Abstract :
With the increased failure rate expected in future extreme scale supercomputers, process replication might become a viable alternative to check pointing. By default, the workload efficiency of replication is limited to 50% because of the additional resources that have to be used to execute the replicas of the application´s processes. In this paper, we introduce intra-parallelization, a solution that avoids replicating all computation by introducing work-sharing between replicas. We show on a representative set of benchmarks that intra-parallelization allows achieving more than 50% efficiency without compromising fault tolerance.
Keywords :
application program interfaces; checkpointing; fault tolerant computing; message passing; parallel processing; MPI applications; checkpointing; extreme scale supercomputers; failure rate; intraparallelization; process replication; Checkpointing; Computer crashes; Context; Fault tolerance; Fault tolerant systems; Kernel; Protocols; High performance computing; fault tolerance; replication;
Conference_Titel :
Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International
Conference_Location :
Hyderabad
DOI :
10.1109/IPDPS.2015.29