DocumentCode
2812209
Title
Analyzing Checkpointing Trends for Applications on the IBM Blue Gene/P System
Author
Naik, H. ; Gupta, R. ; Beckman, P.
Author_Institution
Math. & Comput. Sci. Div., Argonne Nat. Lab., Argonne, IL, USA
fYear
2009
fDate
22-25 Sept. 2009
Firstpage
81
Lastpage
88
Abstract
Current petascale systems have tens of thousands of hardware components and complex system software stacks, which increase the probability of faults occurring during the lifetime of a process. Checkpointing has been a popular method of providing fault tolerance in high-end systems. While considerable research has been done to optimize checkpointing, in practice the method still involves a high-cost overhead for users. In this paper, we study the checkpointing overhead seen by applications running on leadership-class machines such as the IBM Blue Gene/P at Argonne National Laboratory. We study various applications and design a methodology to assist users in understanding and choosing checkpointing frequency and reducing the overhead incurred. In particular, we study three popular applications-the Grid-Based Projector-Augmented Wave application, the Carr-Parrinello Molecular Dynamics application, and a Nek5000 computational fluid dynamics application-and analyze their memory usage and possible checkpointing trends on 32,768 processors of the Blue Gene/P system.
Keywords
checkpointing; computer architecture; fault tolerant computing; Carr-Parrinello molecular dynamics application; IBM Blue Gene/P System; Nek5000 computational fluid dynamics application; checkpointing trends analyzation; complex system software stack; fault tolerance computing; grid based projector augmented wave application; leadership class machine; petascale system; Application software; Checkpointing; Computational fluid dynamics; Design methodology; Fault tolerant systems; Frequency; Hardware; Laboratories; Optimization methods; System software; BG/P; Blue Gene; Checkpointing; Fault Tolerance; Full Checkpoint; Petascale;
fLanguage
English
Publisher
ieee
Conference_Titel
Parallel Processing Workshops, 2009. ICPPW '09. International Conference on
Conference_Location
Vienna
ISSN
1530-2016
Print_ISBN
978-1-4244-4923-1
Electronic_ISBN
1530-2016
Type
conf
DOI
10.1109/ICPPW.2009.42
Filename
5363066
Link To Document