• DocumentCode
    21918
  • Title

    Improving the Reliability of MPI Libraries via Message Flow Checking

  • Author

    Zhezhe Chen ; Qi Gao ; Wenbin Zhang ; Feng Qin

  • Author_Institution
    Ohio State Univ., Columbus, OH, USA
  • Volume
    24
  • Issue
    3
  • fYear
    2013
  • fDate
    Mar-13
  • Firstpage
    535
  • Lastpage
    549
  • Abstract
    Despite the success of the Message Passing Interface (MPI), many MPI libraries have suffered from software bugs. These bugs severely impact the productivity of a large number of users, causing program failures or other errors. As a result, MPI application developers often have to spend days or weeks in vain debugging their own code. To address this daunting problem, this paper presents a new method called FlowChecker, which detects communication related bugs in MPI libraries. First, FlowChecker extracts program intentions of message passing (MP-intentions), which specify messages to be delivered from the sources to the destinations. Then FlowChecker tracks the message flows that actually occur in the underlying MPI libraries. Finally, FlowChecker checks whether the messages are correctly delivered from the sources to the destinations by comparing the message flows against the MP-intentions. If a mismatch is found, FlowChecker reports a bug and provides diagnostic information to help MPI library developers to understand and fix it. We have built a FlowChecker prototype on Linux and evaluated it with five real-world and two injected bug cases in three widely used MPI libraries, including Open MPI, MPICH2, and MVAPICH2. Our experimental results show that FlowChecker effectively detects all seven evaluated bug cases. Additionally, it provides useful diagnostic information for narrowing down or even pinpointing root causes of the bugs. Moreover, our experiments with High Performance Linpack and NAS Parallel Benchmarks show that FlowChecker induces low runtime overhead (0.9-5.6 percent on Open MPI, 0.9-8.1 percent on MPICH2, and 1.6-9.7 percent on MVAPICH2).
  • Keywords
    Linux; application program interfaces; message passing; program debugging; software libraries; software reliability; FlowChecker; High Performance Linpack; Linux; MP-intentions; MPI library reliability; MVAPICH2; NAS parallel benchmarks; Open MPI; message flow checking; message passing interface; message passing program intentions; software bugs; Computer bugs; Libraries; Message passing; Runtime; Semantics; Software; Tracking; Software reliability; bug detection; message passing interfaces;
  • fLanguage
    English
  • Journal_Title
    Parallel and Distributed Systems, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1045-9219
  • Type

    jour

  • DOI
    10.1109/TPDS.2012.127
  • Filename
    6416896