• DocumentCode
    3333724
  • Title

    Software schemes of reconfiguration and recovery in distributed memory multicomputers using the actor model

  • Author

    Peercy, M. ; Banerjee, P.

  • Author_Institution
    Center for Reliable & High Performance Comput., Illinois Univ., Urbana, IL, USA
  • fYear
    1995
  • fDate
    27-30 June 1995
  • Firstpage
    479
  • Lastpage
    488
  • Abstract
    Ideally, a multicomputer system should cope with a processor failure by reconstructing itself-and the application running on itself-in order to maintain the available computational power of the remaining processors. We discuss the continuance of running applications through permanent processor failures. We take advantage of the characteristics of the actor model of parallel computation and dynamically checkpoint the activity of the application. Consequently, the runtime system is able to continue an application through multiple nonconcurrent processor failures. We have implemented our techniques through modifications of the runtime system of the parallel language Charm on an Intel iPSC/s hypercube. After discussing the theory and implementation, we give measurements of overhead due to fault tolerance for a number of applications and demonstrate continuance of the applications after injection of one or more faults.<>
  • Keywords
    distributed memory systems; fault tolerant computing; hypercube networks; parallel languages; parallel processing; reconfigurable architectures; reliability; system recovery; Charm parallel language; Intel iPSC/s hypercube; actor model; applications running; computational power; distributed memory multicomputers; dynamic activity checkpointing; fault injection; fault tolerance; multicomputer system; multiple nonconcurrent processor failure; overhead; parallel computation; permanent processor failures; processor failure; reconfiguration; recovery; runtime system; software schemes; Checkpointing; Computational modeling; Concurrent computing; Distributed computing; Object oriented modeling; Parallel languages; Peer to peer computing; Power system modeling; Power system reliability; Software maintenance;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Fault-Tolerant Computing, 1995. FTCS-25. Digest of Papers., Twenty-Fifth International Symposium on
  • Conference_Location
    Pasadena, CA, USA
  • Print_ISBN
    0-8186-7079-7
  • Type

    conf

  • DOI
    10.1109/FTCS.1995.466950
  • Filename
    466950