• DocumentCode
    166625
  • Title

    Scalable replay with partial-order dependencies for message-logging fault tolerance

  • Author

    Lifflander, Jonathan ; Meneses, Esteban ; Menon, Harshitha ; Miller, Paul ; Krishnamoorthy, Sriram ; Kale, Laxmikant V.

  • Author_Institution
    Dept. of Comput. Sci., Univ. of Illinois Urbana-Champaign, Urbana, IL, USA
  • fYear
    2014
  • fDate
    22-26 Sept. 2014
  • Firstpage
    19
  • Lastpage
    28
  • Abstract
    Deterministic replay of a parallel application is commonly used for discovering bugs or to recover from a hard fault with message-logging fault tolerance. For message passing programs, a major source of overhead during forward execution is recording the order in which messages are sent and received. During replay, this ordering must be used to deterministically reproduce the execution. Previous work in replay algorithms often makes minimal assumptions about the programming model and application to maintain generality. However, in many applications, only a partial order must be recorded due to determinism intrinsic in the program, ordering constraints imposed by the execution model, and events that are commutative (their relative execution order during replay does not need to be reproduced exactly). In this paper, we present a novel algebraic framework for reasoning about the minimum dependencies required to represent the partial order for different orderings and interleavings. By exploiting this framework, we improve on an existing scalable message-logging fault tolerance scheme that uses a total order. The improved scheme scales to 131,072 cores on an IBM BlueGene/P with up to 2× lower overhead.
  • Keywords
    fault tolerant computing; message passing; parallel processing; program debugging; IBM BlueGene/P; bugs discovery; message passing programs; message-logging fault tolerance; parallel application; partial-order dependencies; scalable replay; Benchmark testing; Debugging; Electronic mail; Fault tolerance; Fault tolerant systems; Program processors; Programming; determinism; execution model; fault tolerance; message logging; partial-order dependencies; replay;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cluster Computing (CLUSTER), 2014 IEEE International Conference on
  • Conference_Location
    Madrid
  • Type

    conf

  • DOI
    10.1109/CLUSTER.2014.6968739
  • Filename
    6968739