DocumentCode :
3588932
Title :
MPI Runtime Error Detection with MUST: A Scalable and Crash-Safe Approach
Author :
Protze, Joachim ; Hilbrich, Tobias ; Schulz, Martin ; de Supinski, Bronis R. ; Nagel, Wolfgang E. ; Mueller, Matthias S.
Author_Institution :
RWTH Aachen Univ., Aachen, Germany
fYear :
2014
Firstpage :
206
Lastpage :
215
Abstract :
The Message Passing Interface (MPI) is a widely used paradigm for distributed memory programming. Implementations of this interface are designed for good performance rather than on usability extensions that enforce their correct use. Runtime MPI usage error detection tools aid application developers in the correct use of this interface. Since usage errors can cause failures that lead to an application crash, it is crucial that runtime error detection tools employ techniques that allow them to finish all of their correctness checks. This includes situations in which the application is interrupted by the MPI library, due to an incorrect function call, and operating system signals after fatal errors like division by zero or faulty memory accesses. We present an approach that uses an alternative tool communication means along with signal and error handling capabilities. A study of the assumptions that enable this approach details its applicability for different use cases and compares it to less efficient schemes that rely on synchronous processing and/or communication. Additionally, we enable bandwidth efficient communication with a scalable propagation technique that raises the awareness of an application crash within the tool. An application study with the SPEC MPI2007 benchmark suite demonstrates the applicability of our approach for up to 2,048 processes. Overhead measurements underline that our application crash handling increases the runtime of our runtime error detection tool by only 4% in average.
Keywords :
application program interfaces; distributed memory systems; distributed programming; error detection; error handling; message passing; storage management; system recovery; MPI library; MUST; SPEC MPI2007 benchmark suite; application crash handling; application developent; bandwidth efficient communication; correctness check; crash-safe approach; distributed memory programming; error handling; fatal error; faulty memory access; incorrect function call; message passing interface; operating system signal; overhead measurement; runtime MPI usage error detection tool; scalable propagation technique; signal handling; synchronous processing; usability extension; Bandwidth; Computer crashes; Libraries; Operating systems; Protocols; Runtime; System recovery; MPI; crash safe; debugging; detection;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel Processing Workshops (ICCPW), 2014 43rd International Conference on
ISSN :
1530-2016
Type :
conf
DOI :
10.1109/ICPPW.2014.37
Filename :
7103455
Link To Document :
بازگشت