• DocumentCode
    2792525
  • Title

    Self Adaptive Application Level Fault Tolerance for Parallel and Distributed Computing

  • Author

    Chen, Zizhong ; Yang, Ming ; Francia, Guillermo ; Dongarra, Jack

  • Author_Institution
    MCIS Dept., Jacksonville State Univ., AL
  • fYear
    2007
  • fDate
    26-30 March 2007
  • Firstpage
    1
  • Lastpage
    8
  • Abstract
    Most application level fault tolerance schemes in literature are non-adaptive in the sense that the fault tolerance schemes incorporated in applications are usually designed without incorporating information from system environments such as the amount of available memory and the local or network I/O bandwidth. However, from an application point of view, it is often desirable for fault tolerant high performance applications to be able to achieve high performance under whatever system environment it executes with as low fault tolerance overhead as possible In this paper, we demonstrate that, in order to achieve high reliability with as low performance penalty as possible, fault tolerant schemes in applications need to be able to adapt themselves to different system environments. We propose a framework under which different fault tolerant schemes can be incorporated in applications using an adaptive method. Under this framework, applications are able to choose near optimal fault tolerance schemes at run time according to the specific characteristics of the platform on which the application is executing.
  • Keywords
    fault tolerant computing; multiprocessing systems; parallel processing; distributed computing; network I/O bandwidth; parallel computing; self adaptive application level fault tolerance; Application software; Bandwidth; Contracts; Distributed computing; Fault tolerance; Fault tolerant systems; Grid computing; High performance computing; Lifting equipment; Software libraries;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International
  • Conference_Location
    Long Beach, CA
  • Print_ISBN
    1-4244-0910-1
  • Electronic_ISBN
    1-4244-0910-1
  • Type

    conf

  • DOI
    10.1109/IPDPS.2007.370604
  • Filename
    4228332