DocumentCode
3324331
Title
Starfish: fault-tolerant dynamic MPI programs on clusters of workstations
Author
Agbaria, Adnan M. ; Friedman, Roy
Author_Institution
Dept. of Comput. Sci., Technion-Israel Inst. of Technol., Haifa, Israel
fYear
1999
fDate
1999
Firstpage
167
Lastpage
176
Abstract
This paper reports on the architecture and design of Starfish, an environment for executing dynamic (and static) MPI-2 programs on a cluster of workstations. Starfish is unique in being efficient, fault-tolerant, highly available, and dynamic as a system internally, and in supporting fault-tolerance and dynamicity for its application programs as well. Starfish achieves these goals by combining group communication technology with checkpoint/restart, and uses a novel architecture that is both flexible and portable and keeps group communication outside the critical data path, for maximum performance
Keywords
message passing; software architecture; software fault tolerance; software portability; system recovery; workstation clusters; Starfish; application programs; checkpoint; critical data path; dynamic MPI programs; fault-tolerant programs; group communication technology; maximum performance; restart; software architecture; workstation clusters; Bandwidth; Communications technology; Computer architecture; Computer networks; Computer science; Concurrent computing; Fault tolerance; Operating systems; Portable computers; Workstations;
fLanguage
English
Publisher
ieee
Conference_Titel
High Performance Distributed Computing, 1999. Proceedings. The Eighth International Symposium on
Conference_Location
Redondo Beach, CA
ISSN
1082-8907
Print_ISBN
0-7803-5681-0
Type
conf
DOI
10.1109/HPDC.1999.805295
Filename
805295
Link To Document