DocumentCode :
656235
Title :
Toward a Performance/Resilience Tool for Hardware/Software Co-design of High-Performance Computing Systems
Author :
Engelmann, Christian ; Naughton, Thomas
Author_Institution :
Comput. Sci. & Math. Div., Oak Ridge Nat. Lab., Oak Ridge, TN, USA
fYear :
2013
fDate :
1-4 Oct. 2013
Firstpage :
960
Lastpage :
969
Abstract :
xSim is a simulation-based performance investigation toolkit that permits running high-performance computing (HPC) applications in a controlled environment with millions of concurrent execution threads, while observing application performance in a simulated extreme-scale system for hardware/software co-design. The presented work details newly developed features for xSim that permit the injection of MPI process failures, the propagation/detection/notification of such failures within the simulation, and their handling using application-level checkpoint/restart. These new capabilities enable the observation of application behavior and performance under failure within a simulated future-generation HPC system using the most common fault handling technique.
Keywords :
application program interfaces; checkpointing; concurrency control; hardware-software codesign; message passing; multi-threading; parallel processing; HPC; MPI process failures; application performance; application-level checkpoint; application-level restart; concurrent execution threads; failure detection; failure notification; failure propagation; fault handling technique; hardware-software codesign; high-performance computing systems; performance tool; resilience tool; simulated extreme-scale system; simulation-based performance investigation toolkit; xSim; Computational modeling; Computer architecture; Hardware; Power demand; Reliability; Resilience; Software; Fault Injection; High-performance Computing; Message Passing Interface; Parallel Discrete Event Simulation;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel Processing (ICPP), 2013 42nd International Conference on
Conference_Location :
Lyon
ISSN :
0190-3918
Type :
conf
DOI :
10.1109/ICPP.2013.114
Filename :
6687439
Link To Document :
بازگشت