مرکز منطقه ای اطلاع رساني علوم و فناوري - Athanasia: A User-Transparent and Fault-Tolerant System for Parallel Applications

DocumentCode :

1446851

Title :

Athanasia: A User-Transparent and Fault-Tolerant System for Parallel Applications

Author :

Jung, Hyungsoo ; Han, Hyuck ; Yeom, Heon Y. ; Kang, Sooyong

Author_Institution :

Sch. of Inf. Technol., Univ. of Sydney, Sydney, NSW, Australia

Volume :

Issue :

fYear :

2011

Firstpage :

1653

Lastpage :

1668

Abstract :

This article presents Athanasia, a user-transparent and fault-tolerant system, for parallel applications running on large-scale cluster systems. Cluster systems have been regarded as a de facto standard to achieve multitera-flop computing power. These cluster systems, as we know, have an inherent failure factor that can cause computation failure. The reliability issue in parallel computing systems, therefore, has been studied for a relatively long time in the literature, and we have seen many theoretical promises arise from the extensive research. However, despite the rigorous studies, practical and easily deployable fault-tolerant systems have not been successfully adopted commercially. Athanasia is a user-transparent checkpointing system for a fault-tolerant Message Passing Interface (MPI) implementation that is primarily based on the sync-and-stop protocol. Athanasia supports three critical functionalities that are necessary for fault tolerance: a light-weight failure detection mechanism, dynamic process management that includes process migration, and a consistent checkpoint and recovery mechanism. The main features of Athanasia are that it does not require any modifications to the application code and that it preserves many of the high performance characteristics of high-speed networks. Experimental results show that Athanasia can be a good candidate for practically deployable fault-tolerant systems in very-large and high-performance clusters and that its protocol can be applied to a variety of parallel communication libraries easily.

Keywords :

checkpointing; message passing; parallel processing; software fault tolerance; Athanasia; checkpoint; dynamic process management; fault tolerant system; large scale cluster systems; light weight failure detection mechanism; message passing interface; multiteraflop computing power; parallel applications; parallel communication libraries; process migration; recovery mechanism; reliability issue; user transparent system; Checkpointing; Communication channels; Fault tolerance; Fault tolerant systems; Lead; Process control; Protocols; InfiniBand; Myrinet; User transparency; ch_p4.; fault tolerance; message passing interface; parallel systems;

fLanguage :

English

Journal_Title :

Parallel and Distributed Systems, IEEE Transactions on

Publisher :

ieee

ISSN :

1045-9219

Type :

jour

DOI :

10.1109/TPDS.2011.63

Filename :

5710900

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1446851