• DocumentCode
    2812209
  • Title

    Analyzing Checkpointing Trends for Applications on the IBM Blue Gene/P System

  • Author

    Naik, H. ; Gupta, R. ; Beckman, P.

  • Author_Institution
    Math. & Comput. Sci. Div., Argonne Nat. Lab., Argonne, IL, USA
  • fYear
    2009
  • fDate
    22-25 Sept. 2009
  • Firstpage
    81
  • Lastpage
    88
  • Abstract
    Current petascale systems have tens of thousands of hardware components and complex system software stacks, which increase the probability of faults occurring during the lifetime of a process. Checkpointing has been a popular method of providing fault tolerance in high-end systems. While considerable research has been done to optimize checkpointing, in practice the method still involves a high-cost overhead for users. In this paper, we study the checkpointing overhead seen by applications running on leadership-class machines such as the IBM Blue Gene/P at Argonne National Laboratory. We study various applications and design a methodology to assist users in understanding and choosing checkpointing frequency and reducing the overhead incurred. In particular, we study three popular applications-the Grid-Based Projector-Augmented Wave application, the Carr-Parrinello Molecular Dynamics application, and a Nek5000 computational fluid dynamics application-and analyze their memory usage and possible checkpointing trends on 32,768 processors of the Blue Gene/P system.
  • Keywords
    checkpointing; computer architecture; fault tolerant computing; Carr-Parrinello molecular dynamics application; IBM Blue Gene/P System; Nek5000 computational fluid dynamics application; checkpointing trends analyzation; complex system software stack; fault tolerance computing; grid based projector augmented wave application; leadership class machine; petascale system; Application software; Checkpointing; Computational fluid dynamics; Design methodology; Fault tolerant systems; Frequency; Hardware; Laboratories; Optimization methods; System software; BG/P; Blue Gene; Checkpointing; Fault Tolerance; Full Checkpoint; Petascale;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel Processing Workshops, 2009. ICPPW '09. International Conference on
  • Conference_Location
    Vienna
  • ISSN
    1530-2016
  • Print_ISBN
    978-1-4244-4923-1
  • Electronic_ISBN
    1530-2016
  • Type

    conf

  • DOI
    10.1109/ICPPW.2009.42
  • Filename
    5363066