• DocumentCode
    704127
  • Title

    NanoCheckpoints: A Task-Based Asynchronous Dataflow Framework for Efficient and Scalable Checkpoint/Restart

  • Author

    Subasi, Omer ; Arias, Javier ; Unsal, Osman ; Labarta, Jesus ; Cristal, Adrian

  • Author_Institution
    Barcelona Supercomput. Center, Polytech. Univ. of Catalonia, Barcelona, Spain
  • fYear
    2015
  • fDate
    4-6 March 2015
  • Firstpage
    99
  • Lastpage
    102
  • Abstract
    In this paper, we present NanoCheckpoints which is a lightweight software-based checkpoint/restart scheme for task-parallel HPC applications. We leverage OmpSs, a task-based OpenMP derivative programming model (PM) and its Nanos asynchronous dataflow runtime. NanoCheckpoints achieves minimal overheads by check pointing only tasks´ inputs which are available for free in the OmpSs PM. We evaluate NanoCheckpoints by both pure task-parallel shared memory benchmarks (up to 16 cores) and hybrid OmpSs+MPI applications (up to 1024 cores). The results indicate that NanoCheckpoints has on average overhead 3% for shared memory benchmarks. The dataflow semantics of Nanos, where both check pointing and error recovery are asynchronous, allows NanoCheckpoints to scale at large core counts even when high error rates are present. For hybrid OmpSs+MPI benchmarks, NanoCheckpoints has very low overhead, on average 2%, and high scalability.
  • Keywords
    checkpointing; data flow computing; shared memory systems; NanoCheckpoints; Nanos asynchronous dataflow runtime; OmpSs PM; dataflow semantics; error recovery; hybrid OmpSs+MPI benchmarks; minimal overheads; pure task-parallel shared memory benchmarks; software-based checkpoint scheme; software-based restart scheme; task-based OpenMP derivative programming model; task-based asynchronous dataflow framework; task-parallel HPC applications; Arrays; Benchmark testing; Checkpointing; Instruction sets; Reliability; Runtime; Scalability; Checkpoint/restart; Dataflow; Task parallelism;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel, Distributed and Network-Based Processing (PDP), 2015 23rd Euromicro International Conference on
  • Conference_Location
    Turku
  • ISSN
    1066-6192
  • Type

    conf

  • DOI
    10.1109/PDP.2015.17
  • Filename
    7092706