• DocumentCode
    656235
  • Title

    Toward a Performance/Resilience Tool for Hardware/Software Co-design of High-Performance Computing Systems

  • Author

    Engelmann, Christian ; Naughton, Thomas

  • Author_Institution
    Comput. Sci. & Math. Div., Oak Ridge Nat. Lab., Oak Ridge, TN, USA
  • fYear
    2013
  • fDate
    1-4 Oct. 2013
  • Firstpage
    960
  • Lastpage
    969
  • Abstract
    xSim is a simulation-based performance investigation toolkit that permits running high-performance computing (HPC) applications in a controlled environment with millions of concurrent execution threads, while observing application performance in a simulated extreme-scale system for hardware/software co-design. The presented work details newly developed features for xSim that permit the injection of MPI process failures, the propagation/detection/notification of such failures within the simulation, and their handling using application-level checkpoint/restart. These new capabilities enable the observation of application behavior and performance under failure within a simulated future-generation HPC system using the most common fault handling technique.
  • Keywords
    application program interfaces; checkpointing; concurrency control; hardware-software codesign; message passing; multi-threading; parallel processing; HPC; MPI process failures; application performance; application-level checkpoint; application-level restart; concurrent execution threads; failure detection; failure notification; failure propagation; fault handling technique; hardware-software codesign; high-performance computing systems; performance tool; resilience tool; simulated extreme-scale system; simulation-based performance investigation toolkit; xSim; Computational modeling; Computer architecture; Hardware; Power demand; Reliability; Resilience; Software; Fault Injection; High-performance Computing; Message Passing Interface; Parallel Discrete Event Simulation;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel Processing (ICPP), 2013 42nd International Conference on
  • Conference_Location
    Lyon
  • ISSN
    0190-3918
  • Type

    conf

  • DOI
    10.1109/ICPP.2013.114
  • Filename
    6687439