• DocumentCode
    2484196
  • Title

    DMTCP: Transparent checkpointing for cluster computations and the desktop

  • Author

    Ansel, Jason ; Aryay, Kapil ; Coopermany, Gene

  • Author_Institution
    Comput. Sci. & Artificial Intell. Lab., Massachusetts Inst. of Technol., Cambridge, MA, USA
  • fYear
    2009
  • fDate
    23-29 May 2009
  • Firstpage
    1
  • Lastpage
    12
  • Abstract
    DMTCP (distributed multithreaded checkpointing) is a transparent user-level checkpointing package for distributed applications. Checkpointing and restart is demonstrated for a wide range of over 20 well known applications, including MATLAB, Python, TightVNC, MPICH2, OpenMPI, and runCMS. RunCMS runs as a 680 MB image in memory that includes 540 dynamic libraries, and is used for the CMS experiment of the Large Hadron Collider at CERN. DMTCP transparently checkpoints general cluster computations consisting of many nodes, processes, and threads; as well as typical desktop applications. On 128 distributed cores (32 nodes), checkpoint and restart times are typically 2 seconds, with negligible run-time overhead. Typical checkpoint times are reduced to 0.2 seconds when using forked checkpointing. Experimental results show that checkpoint time remains nearly constant as the number of nodes increases on a medium-size cluster. DMTCP automatically accounts for fork, exec, ssh, mutexes/ semaphores, TCP/IP sockets, UNIX domain sockets, pipes, ptys (pseudo-terminals), terminal modes, ownership of controlling terminals, signal handlers, open file descriptors, shared open file descriptors, I/O (including the readline library), shared memory (via mmap), parent-child process relationships, pid virtualization, and other operating system artifacts. By emphasizing an unprivileged, user-space approach, compatibility is maintained across Linux kernels from 2.6.9 through the current 2.6.28. Since DMTCP is unprivileged and does not require special kernel modules or kernel patches, DMTCP can be incorporated and distributed as a checkpoint-restart module within some larger package.
  • Keywords
    Linux; checkpointing; multi-threading; operating system kernels; Linux kernels; MATLAB; MPICH2; OpenMPI; Python; TightVNC; checkpoint times; checkpoint-restart module; checkpointing packages; cluster computations; distributed cores; distributed multithreaded checkpointing; forked checkpointing; open file descriptors; parent-child process relationships; pid virtualization; restart times; runCMS; shared memory; shared open file descriptors; transparent user-level checkpointing; Checkpointing; Collision mitigation; Kernel; Large Hadron Collider; Libraries; MATLAB; Packaging; Runtime; Sockets; Yarn;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on
  • Conference_Location
    Rome
  • ISSN
    1530-2075
  • Print_ISBN
    978-1-4244-3751-1
  • Electronic_ISBN
    1530-2075
  • Type

    conf

  • DOI
    10.1109/IPDPS.2009.5161063
  • Filename
    5161063