• DocumentCode
    3428306
  • Title

    MCREngine: A scalable checkpointing system using data-aware aggregation and compression

  • Author

    Islam, Tanzima Zerin ; Mohror, Kathryn ; Bagchi, Saurabh ; Moody, Adam ; de Supinski, Bronis R. ; Eigenmann, Rudi

  • Author_Institution
    Sch. of Electr. & Comput. Eng., Purdue Univ., West Lafayette, IN, USA
  • fYear
    2012
  • fDate
    10-16 Nov. 2012
  • Firstpage
    1
  • Lastpage
    11
  • Abstract
    High performance computing (HPC) systems use checkpoint-restart to tolerate failures. Typically, applications store their states in checkpoints on a parallel file system (PFS). As applications scale up, checkpoint-restart incurs high overheads due to contention for PFS resources. The high overheads force large-scale applications to reduce checkpoint frequency, which means more compute time is lost in the event of failure. We alleviate this problem through a scalable checkpointrestart system, MCRENGINE. MCRENGINE aggregates checkpoints from multiple application processes with knowledge of the data semantics available through widely-used I/O libraries, e.g., HDF5 and netCDF, and compresses them. Our novel scheme improves compressibility of checkpoints up to 115% over simple concatenation and compression. Our evaluation with large-scale application checkpoints show that MCRENGINE reduces checkpointing overhead by up to 87% and restart overhead by up to 62% over a baseline with no aggregation or compression.
  • Keywords
    checkpointing; data compression; parallel processing; MCREngine; PFS; checkpoint frequency; data compression; data semantics; data-aware aggregation; high performance computing systems; large-scale application checkpoints; parallel file system; scalable checkpointing system; Arrays; Checkpointing; Computer numerical control; Libraries; Message systems; Reactive power; Transceivers;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for
  • Conference_Location
    Salt Lake City, UT
  • ISSN
    2167-4329
  • Print_ISBN
    978-1-4673-0805-2
  • Type

    conf

  • DOI
    10.1109/SC.2012.77
  • Filename
    6468462