DocumentCode
3333724
Title
Software schemes of reconfiguration and recovery in distributed memory multicomputers using the actor model
Author
Peercy, M. ; Banerjee, P.
Author_Institution
Center for Reliable & High Performance Comput., Illinois Univ., Urbana, IL, USA
fYear
1995
fDate
27-30 June 1995
Firstpage
479
Lastpage
488
Abstract
Ideally, a multicomputer system should cope with a processor failure by reconstructing itself-and the application running on itself-in order to maintain the available computational power of the remaining processors. We discuss the continuance of running applications through permanent processor failures. We take advantage of the characteristics of the actor model of parallel computation and dynamically checkpoint the activity of the application. Consequently, the runtime system is able to continue an application through multiple nonconcurrent processor failures. We have implemented our techniques through modifications of the runtime system of the parallel language Charm on an Intel iPSC/s hypercube. After discussing the theory and implementation, we give measurements of overhead due to fault tolerance for a number of applications and demonstrate continuance of the applications after injection of one or more faults.<>
Keywords
distributed memory systems; fault tolerant computing; hypercube networks; parallel languages; parallel processing; reconfigurable architectures; reliability; system recovery; Charm parallel language; Intel iPSC/s hypercube; actor model; applications running; computational power; distributed memory multicomputers; dynamic activity checkpointing; fault injection; fault tolerance; multicomputer system; multiple nonconcurrent processor failure; overhead; parallel computation; permanent processor failures; processor failure; reconfiguration; recovery; runtime system; software schemes; Checkpointing; Computational modeling; Concurrent computing; Distributed computing; Object oriented modeling; Parallel languages; Peer to peer computing; Power system modeling; Power system reliability; Software maintenance;
fLanguage
English
Publisher
ieee
Conference_Titel
Fault-Tolerant Computing, 1995. FTCS-25. Digest of Papers., Twenty-Fifth International Symposium on
Conference_Location
Pasadena, CA, USA
Print_ISBN
0-8186-7079-7
Type
conf
DOI
10.1109/FTCS.1995.466950
Filename
466950
Link To Document