Title :
Fault tolerant computing on the grid: what are my options?
Author :
Weissman, Jon B.
Author_Institution :
Div. of Comput. Sci., Texas Univ., San Antonio, TX, USA
Abstract :
High-performance distributed computing across wide-area networks has become an active topic of research. Achieving large-scale distributed computing in a seamless manner introduces a number of difficult problems. This paper examines one of the most critical problems, fault tolerance. We have examined fault tolerance options for a common class of high-performance parallel applications, single-program-multiple-data (SPMD). Performance models for two fault tolerance methods, checkpoint-recovery (CR) and wide-area replication (WR), have been developed. These models enable quantitative comparisons of the two methods as applied to SPMD applications
Keywords :
fault tolerant computing; wide area networks; checkpoint-recovery; distributed computing; fault tolerance; parallel applications; single-program-multiple-data; wide-area networks; wide-area replication; Checkpointing; Chromium; Computer science; Cost function; Distributed computing; Fault tolerance; File servers; Grid computing; Large-scale systems; Testing;
Conference_Titel :
High Performance Distributed Computing, 1999. Proceedings. The Eighth International Symposium on
Conference_Location :
Redondo Beach, CA
Print_ISBN :
0-7803-5681-0
DOI :
10.1109/HPDC.1999.805323