• DocumentCode
    560175
  • Title

    FTI: High performance Fault Tolerance Interface for hybrid systems

  • Author

    Bautista-Gomez, Leonardo ; Komatitsch, Dimitri ; Maruyama, Naoya ; Tsuboi, Seiji ; Cappello, Franck ; Matsuoka, Satoshi

  • Author_Institution
    Tokyo Inst. of Technol., Tokyo, Japan
  • fYear
    2011
  • fDate
    12-18 Nov. 2011
  • Firstpage
    1
  • Lastpage
    12
  • Abstract
    Large scientific applications deployed on current petascale systems expend a significant amount of their execution time dumping checkpoint files to remote storage. New fault tolerant techniques will be critical to efficiently exploit post-petascale systems. In this work, we propose a low-overhead high-frequency multi-level checkpoint technique in which we integrate a highly-reliable topology-aware Reed-Solomon encoding in a three-level checkpoint scheme. We efficiently hide the encoding time using one Fault-Tolerance dedicated thread per node. We implement our technique in the Fault Tolerance Interface FTI. We evaluate the correctness of our performance model and conduct a study of the reliability of our library. To demonstrate the performance of FTI, we present a case study of the Mw9.0 Tohoku Japan earthquake simulation with SPECFEM3D on TSUBAME2.0. We demonstrate a checkpoint overhead as low as 8% on sustained 0.1 petaflops runs (1152 GPUs) while check-pointing at high frequency.
  • Keywords
    Reed-Solomon codes; checkpointing; earthquakes; fault tolerant computing; geophysics computing; graphics processing units; mainframes; topology; user interfaces; Mw9.0 Tohoku Japan earthquake simulation; SPECFEM3D; Tsubame2.0; fault tolerance interface; hybrid system; low-overhead high-frequency multilevel checkpoint technique; petascale system; three-level checkpoint scheme; topology-aware Reed-Solomon encoding; Computational modeling; Encoding; Fault tolerance; Fault tolerant systems; Libraries; Writing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High Performance Computing, Networking, Storage and Analysis (SC), 2011 International Conference for
  • Conference_Location
    Seatle, WA
  • Electronic_ISBN
    978-1-4503-0771-0
  • Type

    conf

  • Filename
    6114441