• DocumentCode
    2798331
  • Title

    Design and Implementation of Multiple Fault-Tolerant MPI over Myrinet (M^3)

  • Author

    Jung, Hyungsoo ; Shin, Dongin ; Han, Hyuck ; Kim, Jai W. ; Yeom, Heon Y. ; Lee, Jongsuk

  • Author_Institution
    Seoul National University
  • fYear
    2005
  • fDate
    12-18 Nov. 2005
  • Firstpage
    32
  • Lastpage
    32
  • Abstract
    Advances in network technology and computing power have inspired the emergence of high-performance cluster computing systems. While cluster management and hardware highavailability tools are readily available, practical and easily deployable fault-tolerant systems have not been successfully adopted commercially. We present a fault-tolerant system, Multiple fault-tolerant MPI over Myrinet (M3), that differs in notable respects from other proposed fault-tolerant systems in the literature. M3 is built on top of Myrinet since it is regarded as one of the best solutions for highperformance networks and is widely used in cluster computing systems because it can provide a high-speed switching network that is an inevitable ingredient in interconnecting clusters of workstations or PCs. M^3 is a user-transparent checkpointing system for multiple fault-tolerant MPI implementation that is primarily based on the coordinated checkpointing protocol. M3 supports three critical functionalities that are necessary for faulttolerance: a light-weight failure detection mechanism, dynamic process management that includes process migration, and a consistent checkpoint and recovery mechanism. The features of M are that it requires no modifications of application code and that it preserves much of the high performance characteristics of Myrinet. This paper describes the architecture of M3, its detailed design principles and comprehensive implementation issues. We also propose practical solutions for those involved in constructing highly available cluster systems for parallel programming systems. Experimental results substantiate our assertion that M3 can be a good candidate for practically deployable fault-tolerant systems in very-large and high-performance Myrinet clusters and that its protocol can be applied to a wide variety of parallel communication libraries without difficulty.
  • Keywords
    Checkpointing; Computer networks; Disaster management; Fault tolerance; Fault tolerant systems; Hardware; Personal communication networks; Power system management; Protocols; Workstations;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Supercomputing, 2005. Proceedings of the ACM/IEEE SC 2005 Conference
  • Print_ISBN
    1-59593-061-2
  • Type

    conf

  • DOI
    10.1109/SC.2005.22
  • Filename
    1559984