DocumentCode
87833
Title
Diagnosis of Performance Faults in LargeScale MPI Applications via Probabilistic Progress-Dependence Inference
Author
Laguna, Ignacio ; Ahn, Dong H. ; de Supinski, Bronis R. ; Bagchi, Saurabh ; Gamblin, Todd
Author_Institution
Lawrence Livermore Nat. Lab., Livermore, CA, USA
Volume
26
Issue
5
fYear
2015
fDate
May 1 2015
Firstpage
1280
Lastpage
1289
Abstract
Debugging large-scale parallel applications is challenging. Most existing techniques provide little information about failure root causes. Further, most debuggers significantly slow down program execution, and run sluggishly with massively parallel applications. This paper presents a novel technique that scalably infers the tasks in a parallel program on which a failure occurred, as well as the code in which it originated. Our technique combines scalable runtime analysis with static analysis to determine the least-progressed task(s) and to identify the code lines at which the failure arose. We present a novel algorithm that infers probabilistically progress dependence among MPI tasks using a globally constructed Markov model that represents tasks´ control-flow behavior. In comparison to previous work, our algorithm infers more precisely the least-progressed task. We combine this technique with static backward slicing analysis, further isolating the code responsible for the current state. A blind study demonstrates that our technique isolates the root cause of a concurrency bug in a molecular dynamics simulation, which only manifests itself at 7,996 tasks or more. We extensively evaluate fault coverage of our technique via fault injections in 10 HPC benchmarks and show that our analysis takes less than a few seconds on thousands of parallel tasks.
Keywords
Markov processes; application program interfaces; concurrency control; fault diagnosis; message passing; parallel programming; probability; program debugging; program slicing; software performance evaluation; HPC benchmarks; MPI applications; Markov model; code isolation; code line identification; concurrency bug; failure root causes; large-scale parallel application debugging; molecular dynamics simulation; parallel program; parallel tasks; performance fault diagnosis; probabilistic progress-dependence inference; program execution; scalable runtime analysis; static analysis; static backward slicing analysis; task control-flow behavior; Algorithm design and analysis; Benchmark testing; Computational modeling; Debugging; Handheld computers; Markov processes; Probabilistic logic; Distributed debugging; MPI; parallel applications; progress dependence;
fLanguage
English
Journal_Title
Parallel and Distributed Systems, IEEE Transactions on
Publisher
ieee
ISSN
1045-9219
Type
jour
DOI
10.1109/TPDS.2014.2314100
Filename
6803050
Link To Document