DocumentCode :
2126275
Title :
Fault-tolerance in filter-labeled-stream applications
Author :
Coutinho, Bruno ; Guedes, Dorgival ; Meira, Wagner, Jr. ; Ferreira, Renato A.
Author_Institution :
Univ. Fed. de Minas Gerais, Belo Horizonte
fYear :
2007
fDate :
24-27 Oct. 2007
Firstpage :
229
Lastpage :
236
Abstract :
Fault tolerance is a desirable feature in distributed high-performance systems, since applications tend to run for long periods of time and faults become more likely as the number of nodes in the system increase. However, most distributed environments lack any fault tolerant features, since they tend to be hard to implement and use, and often hurt performance dramatically. In this paper we discuss how we successfully added fault-tolerance to the Anthill distributed programming environment by using an application-level checkpoint/rollback solution. The programming model offers an abstraction where the programmer can easily identify points during the execution where the communication pattern is well defined, forming a consistent cut where checkpoints may be saved consistently without requiring extra communication, avoiding any domino effect during recovery from faults. We present the new abstractions for fault tolerance, describe how the solution was implemented and present performance results that show the efficiency of the solution with both regular and irregular applications.
Keywords :
checkpointing; distributed programming; fault tolerant computing; programming environments; Anthill distributed programming environment; application-level checkpoint solution; application-level rollback solution; distributed high-performance systems; fault tolerance abstractions; filter labeled stream applications; Application software; Availability; Computer architecture; Computer science; Fault diagnosis; Fault tolerance; Fault tolerant systems; High performance computing; Programming environments; Programming profession;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer Architecture and High Performance Computing, 2007. SBAC-PAD 2007. 19th International Symposium on
Conference_Location :
Rio Grande do Sul
ISSN :
1550-6533
Print_ISBN :
978-0-7695-3014-7
Type :
conf
DOI :
10.1109/SBAC-PAD.2007.31
Filename :
4384062
Link To Document :
بازگشت