DocumentCode :
560144
Title :
Evaluating the viability of process replication reliability for exascale systems
Author :
Ferreira, Kurt ; Stearley, Jon ; Laros, James H., III ; Oldfield, Ron ; Pedretti, Kevin ; Brightwell, Ron ; Riesen, Rolf ; Bridges, Patrick G. ; Arnold, Dorian
Author_Institution :
Scalable Syst. Software Dept., Sandia Nat. Labs., Albuquerque, NM, USA
fYear :
2011
fDate :
12-18 Nov. 2011
Firstpage :
1
Lastpage :
12
Abstract :
As high-end computing machines continue to grow in size, issues such as fault tolerance and reliability limit application scalability. Current techniques to ensure progress across faults, like checkpoint-restart, are increasingly problematic at these scales due to excessive overheads predicted to more than double an application´s time to solution. Replicated computing techniques, particularly state machine replication, long used in distributed and mission critical systems, have been suggested as an alternative to checkpoint-restart. In this paper, we evaluate the viability of using state machine replication as the primary fault tolerance mechanism for upcoming exascale systems. We use a combination of modeling, empirical analysis, and simulation to study the costs and benefits of this approach in comparison to check-point/restart on a wide range of system parameters. These results, which cover different failure distributions, hardware mean time to failures, and I/O bandwidths, show that state machine replication is a potentially useful technique for meeting the fault tolerance demands of HPC applications on future exascale platforms.
Keywords :
checkpointing; distributed processing; fault tolerant computing; finite state machines; HPC applications; checkpoint-restart; distributed systems; exascale systems; failure distribution; fault tolerance mechanism; high-end computing machines; mission critical systems; process replication reliability; replicated computing techniques; state machine replication; Bandwidth; Computer crashes; Fault tolerance; Fault tolerant systems; Hardware; Protocols; Sockets;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
High Performance Computing, Networking, Storage and Analysis (SC), 2011 International Conference for
Conference_Location :
Seatle, WA
Electronic_ISBN :
978-1-4503-0771-0
Type :
conf
Filename :
6114406
Link To Document :
بازگشت