• DocumentCode
    625574
  • Title

    Adaptive Incremental Checkpointing via Delta Compression for Networked Multicore Systems

  • Author

    Jangjaimon, Itthichok ; Nian-Feng Tzeng

  • Author_Institution
    Center for Adv. Comput. Studies, Univ. of Louisiana at Lafayette, Lafayette, LA, USA
  • fYear
    2013
  • fDate
    20-24 May 2013
  • Firstpage
    7
  • Lastpage
    18
  • Abstract
    Checkpointing has been widely adopted in support of fault-tolerance and job migration, with checkpoint files preferably kept also at remote storage to withstand unavailability/failures of local nodes in networked systems. Lately, I/O bandwidth to remote storage becomes the bottleneck for checkpointing on a large-scale system. This paper proposes an adaptive incremental checkpointing (AIC), aiming to reduce the checkpointing file size considerably so that its involved overhead is lowered and thus the expected job turnaround time drops. Given production multicore systems are observed to have unused cores often available, we design AIC to make use of separate cores for carrying out multi-level checkpointing with delta compression at desirable points of time adaptively. We develop a new Markov model for predicting the performance of such multi-level concurrent checkpointing, with AIC performance evaluated using six SPEC benchmarks under various system sizes. AIC is observed to lower the normalized expected turnaround time substantially (by up to 47%) when compared to its static counterpart and a recent multi-level checkpointing scheme with fixed checkpoint intervals.
  • Keywords
    Markov processes; checkpointing; data compression; fault tolerant computing; multiprocessing systems; AIC performance; IO bandwidth; Markov model; SPEC benchmarks; adaptive incremental checkpointing; delta compression; fault-tolerance; fixed checkpoint intervals; job migration; large-scale system; multilevel checkpointing scheme; networked multicore systems; production multicore systems; remote storage; Bandwidth; Benchmark testing; Checkpointing; Markov processes; Multicore processing; Numerical models; Runtime; Adaptive checkpointing; Markov model; delta compression; fault tolerance; incremental checkpointing; multicore systems; two-level checkpointing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on
  • Conference_Location
    Boston, MA
  • ISSN
    1530-2075
  • Print_ISBN
    978-1-4673-6066-1
  • Type

    conf

  • DOI
    10.1109/IPDPS.2013.33
  • Filename
    6569796