Title :
Immunet: a cheap and robust fault-tolerant packet routing mechanism
Author :
Puente, V. ; Gregorio, J.A. ; Vallejo, F. ; Beivide, R.
Author_Institution :
Comput. Archit. Res. Group, Cantabria Univ., Santander, Spain
Abstract :
A new and efficient mechanism to tolerate failures in interconnection networks for parallel and distributed computers, denoted as Immunet, is presented in this work. In the presence of failures, Immunet automatically reacts with a hardware reconfiguration of the surviving network resources. Immunet has four important advantages over previous fault-tolerant switching mechanisms. Its low hardware costs minimize the overhead that the network must support in absence of faults. As long as the network remains connected, Immunet can tolerate any number of failures regardless of their spatial and temporal combinations. The resulting communication infrastructure provides optimized adaptive minimal routing over the surviving topology. The system behavior under successive failures exhibits graceful performance degradation. Immunet reconfiguration can be totally transparent to the applications running on the parallel system as they will only be affected by the loss of those data packets circulating through the broken components. The rest of the packets will suffer only a tolerable delay induced by the time employed to perform the automatic network reconfiguration. Descriptions of the hardware network architecture and detailed synthetic and execution-driven simulations will demonstrate the benefits of Immunet.
Keywords :
fault tolerance; multiprocessor interconnection networks; network routing; network topology; packet switching; parallel architectures; reconfigurable architectures; Immunet reconfiguration; automatic network reconfiguration; communication infrastructure; data packet circulation; distributed computers; fault-tolerant packet routing mechanism; fault-tolerant switching mechanisms; hardware cost minimization; hardware network architecture; hardware reconfiguration; interconnection networks; network overhead; network resources; network topology; optimized adaptive minimal routing; parallel computers; parallel system; system behavior; Communication switching; Computer networks; Concurrent computing; Costs; Distributed computing; Fault tolerance; Hardware; Multiprocessor interconnection networks; Robustness; Routing;
Conference_Titel :
Computer Architecture, 2004. Proceedings. 31st Annual International Symposium on
Print_ISBN :
0-7695-2143-6
DOI :
10.1109/ISCA.2004.1310775