• DocumentCode
    3021831
  • Title

    A Programming Language Approach to Fault Tolerance for Fork-Join Parallelism

  • Author

    Zengin, Mustafa ; Vafeiadis, Viktor

  • fYear
    2013
  • fDate
    1-3 July 2013
  • Firstpage
    105
  • Lastpage
    112
  • Abstract
    When running big parallel computations on thousands of processors, the probability that an individual processor will fail during the execution cannot be ignored. Computations should be replicated, or else failures should be detected at runtime and failed subcomputations reexecuted. We follow the latter approach and propose a high-level operational semantics that detects computation failures, and allows failed computations to be restarted from the point of failure. We implement this high-level semantics with a lower-level operational semantics that provides a more accurate account of processor failures, and prove in Coq the correspondence between the high- and low-level semantics.
  • Keywords
    checkpointing; fault tolerant computing; parallel processing; programming language semantics; Coq; checkpointing; computation failure detection; fault tolerance; fork-join parallelism; high-level operational semantics; lower-level operational semantics; parallel computations; processor failures; programming language; Checkpointing; Computational modeling; Context; Parallel processing; Program processors; Semantics; Standards;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Theoretical Aspects of Software Engineering (TASE), 2013 International Symposium on
  • Conference_Location
    Birmingham
  • Type

    conf

  • DOI
    10.1109/TASE.2013.22
  • Filename
    6597884