DocumentCode :
1907958
Title :
Modeling the Impact of Checkpoints on Next-Generation Systems
Author :
Oldfield, Ron A. ; Arunagiri, Sarala ; Teller, Patricia J. ; Seelam, Seetharami ; Varela, Maria Ruiz ; Riesen, Rolf ; Roth, Philip C.
Author_Institution :
Sandia Nat. Labs, Livermore
fYear :
2007
fDate :
24-27 Sept. 2007
Firstpage :
30
Lastpage :
46
Abstract :
The next generation of capability-class, massively parallel processing (MPP) systems is expected to have hundreds of thousands of processors. For application-driven, periodic checkpoint operations, the state-of-the-art does not provide a solution that scales to next-generation systems. We demonstrate this by using mathematical modeling to compute a lower bound of the impact of these approaches on the performance of applications executed on three massive-scale, in-production, DOE systems and a theoretical petaflop system. We also adapt the model to investigate a proposed optimization that makes use of "lightweight" storage architectures and overlay networks to overcome the storage system bottleneck. Our results indicate that (1) as we approach the scale of next-generation systems, traditional checkpoint/restart approaches will increasingly impact application performance, accounting for over 50% of total application execution time; (2) although our alternative approach improves performance, it has limitations of its own; and (3) there is a critical need for new approaches to fault tolerance that allow continuous computing with minimal impact on application scalability.
Keywords :
checkpointing; memory architecture; parallel processing; application-driven periodic checkpoint operations; capability-class MPP systems; lightweight storage architectures; massive-scale in-production DOE systems; massively parallel processing systems; mathematical modeling; next-generation systems; overlay networks; petaflop system; Bandwidth; Computer networks; Contracts; Delay; Fault tolerance; Fault tolerant systems; Laboratories; Large-scale systems; Parallel processing; US Department of Energy;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Mass Storage Systems and Technologies, 2007. MSST 2007. 24th IEEE Conference on
Conference_Location :
San Diego, CA
Print_ISBN :
978-0-7695-3025-3
Type :
conf
DOI :
10.1109/MSST.2007.4367962
Filename :
4367962
Link To Document :
بازگشت