• DocumentCode
    1854115
  • Title

    A Fast-Start, Fault-Tolerant MPI Launcher on Dawning Supercomputers

  • Author

    Liu, Xu ; Tu, Bibo ; Zhan, Jianfeng ; Meng, Dan

  • Author_Institution
    Nat. Res. Center for Intell. Comput. Syst., Chinese Acad. of Sci., Beijing
  • fYear
    2008
  • fDate
    1-4 Dec. 2008
  • Firstpage
    263
  • Lastpage
    266
  • Abstract
    Daemon-based MPI launchers are the mainstream in nowadays, because they can startup processes rapidly. However, effective task management and fault tolerance become more important as the scale of supercomputers enlarges. A new fast-start and fault tolerant launcher, called SFLauncher, has been used to startup MPICH task on Dawning supercomputers. This paper details its features and implementation, with emphasis on scalability, self-organization algorithm and garbage reclamation. The results of performance evaluation on SFLauncher are also given.
  • Keywords
    application program interfaces; fault tolerance; message passing; parallel machines; dawning supercomputer; fast-start fault-tolerant MPI launcher; message passing interface; task management; Computer crashes; Distributed computing; Fault tolerance; Fault tolerant systems; Heart beat; Intelligent systems; Partitioning algorithms; Peer to peer computing; Scalability; Supercomputers; MPI launcher; fault tolerance; scalability;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Computing, Applications and Technologies, 2008. PDCAT 2008. Ninth International Conference on
  • Conference_Location
    Otago
  • Print_ISBN
    978-0-7695-3443-5
  • Type

    conf

  • DOI
    10.1109/PDCAT.2008.56
  • Filename
    4710990