DocumentCode :
1858756
Title :
On Communication Determinism in Parallel HPC Applications
Author :
Cappello, Franck ; Guermouche, Amina ; Snir, Marc
Author_Institution :
INRIA Saclay-Ile de France, Orsay, France
fYear :
2010
fDate :
2-5 Aug. 2010
Firstpage :
1
Lastpage :
8
Abstract :
Current fault tolerant protocols for high performance computing parallel applications have two major drawbacks: either they require to restart all processes even in the case of only a single process failure or they have a high performance overhead in fault free situation. As a consequence none of existing generic fault tolerant protocols matches needs of HPC applications and surprisingly, there is no fault tolerant protocol dedicated to them. One way to design better fault tolerant protocols for HPC applications is to explore and take advantage of their specific characteristics. In particular we suspect that most of them present some form of determinism in communication patterns. Communication determinism can play an important role in the design of new fault tolerant protocols by reducing their complexity. In this paper, we explore the communication determinism in 27 HPC parallel applications that are representative of production workloads in large scale centers. We show that most of these applications have deterministic or send-deterministic communication patterns.
Keywords :
fault tolerant computing; protocols; communication determinism; fault tolerant protocols; high performance computing parallel applications; parallel HPC applications; send-deterministic communication patterns; Benchmark testing; Computational modeling; Fault tolerance; Fault tolerant systems; Kernel; Physics; Protocols;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer Communications and Networks (ICCCN), 2010 Proceedings of 19th International Conference on
Conference_Location :
Zurich
ISSN :
1095-2055
Print_ISBN :
978-1-4244-7114-0
Type :
conf
DOI :
10.1109/ICCCN.2010.5560143
Filename :
5560143
Link To Document :
بازگشت