• DocumentCode
    3425765
  • Title

    Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI

  • Author

    Coti, Camille ; Herault, Thomas ; Lemarinier, Pierre ; Pilard, Laurence ; Rezmerita, Ala ; Rodriguez, Eric ; Cappello, Franck

  • Author_Institution
    Lab. de Recherche en Informatique, Univ. Paris-XI
  • fYear
    2006
  • fDate
    Nov. 2006
  • Firstpage
    18
  • Lastpage
    18
  • Abstract
    A long-term trend in high-performance computing is the increasing number of nodes in parallel computing platforms, which entails a higher failure probability. Fault programming environments should be used to guarantee the safe execution of critical applications. Research in fault tolerant MPI has led to the development of several fault tolerant MPI environments. Different approaches are being proposed using a variety of fault tolerant message passing protocols based on coordinated checkpointing or message logging. The most popular approach is with coordinated checkpointing. In the literature, two different concepts of coordinated checkpointing have been proposed: blocking and non-blocking. However they have never been compared quantitatively and their respective scalability remains unknown. The contribution of this paper is to provide the first comparison between these two approaches and a study of their scalability. We have implemented the two approaches within the MPICH environments and evaluate their performance using the NAS parallel benchmarks
  • Keywords
    application program interfaces; checkpointing; message passing; parallel machines; software fault tolerance; high performance computing; large-scale fault tolerant MPI; message passing protocol; nonblock coordinated checkpoint; parallel computing platform; Application software; Checkpointing; Concurrent computing; Fault tolerance; Large-scale systems; Message passing; Programming environments; Protocols; Scalability; Supercomputers;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    SC 2006 Conference, Proceedings of the ACM/IEEE
  • Conference_Location
    Tampa, FL
  • Print_ISBN
    0-7695-2700-0
  • Electronic_ISBN
    0-7695-2700-0
  • Type

    conf

  • DOI
    10.1109/SC.2006.15
  • Filename
    4090192