Title :
Automatic Path Migration over InfiniBand: Early Experiences
Author :
Vishnu, Abhinav ; Mamidala, Amith R. ; Narravula, Sundeep ; Panda, Dhabaleswar K.
Author_Institution :
Dept. of Comput. & Eng., Ohio State Univ., Columbus, OH
Abstract :
High computational power of commodity PCs combined with the emergence of low latency and high bandwidth interconnects has escalated the trends of cluster computing. Clusters with InfiniBand are being deployed, as reflected in the TOP 500 Supercomputer rankings. However, increasing scale of these clusters has reduced the mean time between failures (MTBF) of components. Network component is one such component of clusters, where failure of network interface cards (NICs), cables and/or switches breaks existing path(s) of communication. InfiniBand provides a hardware mechanism, automatic path migration (APM), which allows user transparent detection and recovery from network fault(s), without application restart. In this paper, we design a set of modules; which work together for providing network fault tolerance for user level applications leveraging the APM feature. Our performance evaluation at the MPI layer shows that APM incurs negligible overhead in the absence of faults in the system. In the presence of network faults, APM incurs negligible overhead for reasonably long running applications.
Keywords :
fault tolerant computing; message passing; multiprocessor interconnection networks; performance evaluation; workstation clusters; InfiniBand; MPI layer; automatic path migration; bandwidth interconnects; cluster computing; commodity PC; network fault tolerance; performance evaluation; Bandwidth; Communication cables; Communication switching; Delay; Fault detection; Hardware; Network interfaces; Personal communication networks; Supercomputers; Switches; APM; InfiniBand; MPI; MTBF; Verbs;
Conference_Titel :
Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International
Conference_Location :
Long Beach, CA
Print_ISBN :
1-4244-0910-1
Electronic_ISBN :
1-4244-0910-1
DOI :
10.1109/IPDPS.2007.370626