DocumentCode :
2593138
Title :
A fault tolerant approach in cluster computing system
Author :
Shwe, Thanda ; Aye, Win
Author_Institution :
Dept. of Inf. Technol., Mandalay Technol. Univ., Mandalay
Volume :
1
fYear :
2008
fDate :
14-17 May 2008
Firstpage :
149
Lastpage :
152
Abstract :
A long-term trend in high performance computing is the increasing number of nodes in parallel computing platforms, which entails a higher failure probability. Hence, fault tolerance becomes a key property for parallel application running on parallel computing systems. The message passing interface (MPI) is currently the programming paradigm and communication library most commonly used on parallel computing platforms. MPI applications may be stopped at any time during their execution due to an unpredictable failure. In order to avoid complete restarts of an MPI application because of only one failure, a fault tolerant MPI implementation is essential. In this paper, we propose a fault tolerant approach in cluster computing system. Our approach is based on reassignment of tasks to the remaining system and message logging is used for message losses. This system consists of two main parts, failure diagnosis and failure recovery. Failure diagnosis is the detection of a failure and failure recovery is the action needed to take over the workload of a failed component. This fault tolerant approach is implemented as an extension of the message passing interface.
Keywords :
fault tolerant computing; message passing; parallel processing; probability; program diagnostics; software libraries; system recovery; workstation clusters; cluster computing system; communication library; failure diagnosis; failure probability; failure recovery; fault tolerant approach; high performance computing; message logging; message passing interface; parallel computing platforms; Application software; Clustering algorithms; Concurrent computing; Fault tolerance; Fault tolerant systems; Hardware; High performance computing; Message passing; Parallel processing; Parallel programming;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology, 2008. ECTI-CON 2008. 5th International Conference on
Conference_Location :
Krabi
Print_ISBN :
978-1-4244-2101-5
Electronic_ISBN :
978-1-4244-2102-2
Type :
conf
DOI :
10.1109/ECTICON.2008.4600394
Filename :
4600394
Link To Document :
بازگشت