• DocumentCode
    822110
  • Title

    FT²EI: A Dynamic Fault-Tolerant Routing Methodology for Fat Trees with Exclusion Intervals

  • Author

    Requena, Crispín Gómez ; Requena, María Engracia Gómez ; Rodríguez, Pedro Juan López ; Marín, José Francisco Duato

  • Author_Institution
    Dept. of Comput. Eng. (DISCA), Univ. Politec. de Valencia, Valencia
  • Volume
    20
  • Issue
    6
  • fYear
    2009
  • fDate
    6/1/2009 12:00:00 AM
  • Firstpage
    802
  • Lastpage
    817
  • Abstract
    Fault tolerance in the interconnection network of large clusters of PCs is an issue of growing importance, since their increasing size also increases the failure probability. The fat-tree topology is usually used in these machines since it has become very popular among high-speed interconnect manufacturers. This paper proposes a new distributed fault-tolerant routing methodology for fat trees. Unlike other previous proposals, it does not require additional network hardware, and its memory requirements, switch hardware, and routing delay scales up with the network size. Indeed, it nullifies only the strictly necessary paths, allowing adaptive routing through the healthy paths. The methodology is based on enhancing the interval routing scheme with exclusion intervals. Exclusion intervals are associated to each switch output port and represent the nodes that are unreachable from this port after a fault. We propose a methodology to identify the links where the exclusion intervals must be updated after a fault, the values to write on them, and a very efficient mechanism to distribute the required information through the network without stopping the system activity. Our methodology can tolerate a high number of network failures with a low degradation in performance. Moreover, it can achieve zero packet losing during the updating period.
  • Keywords
    fault tolerance; multiprocessor interconnection networks; network routing; network topology; probability; trees (mathematics); dynamic distributed fault-tolerant routing methodology; exclusion interval; failure probability; fat-tree topology; interconnection network; interval routing scheme; large PC cluster; zero packet losing; Adaptive routing; Dynamic fault model; Fat-trees; Fault tolerance; Memory-effective routing; Network Architecture and Design; adaptive routing; dynamic fault model; fat trees; memory-effective routing.;
  • fLanguage
    English
  • Journal_Title
    Parallel and Distributed Systems, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1045-9219
  • Type

    jour

  • DOI
    10.1109/TPDS.2008.130
  • Filename
    4585369