• DocumentCode
    2243546
  • Title

    Achieving Predictable High Performance in Imbalanced Fat Trees

  • Author

    Bogdanski, Bartosz ; Sem-Jacobsen, Frank Olaf ; Reinemo, Sven-Arne ; Skeie, Tor ; Holen, Line ; Huse, Lars Paul

  • Author_Institution
    Simula Res. Lab., Lysaker, Norway
  • fYear
    2010
  • fDate
    8-10 Dec. 2010
  • Firstpage
    381
  • Lastpage
    388
  • Abstract
    The fat-tree topology has become a popular choice for InfiniBand fabrics due to its inherent deadlock freedom, fault-tolerance and full bisection bandwidth. InfiniBand is used by more than 40% of the systems on the latest Top 500 list, and many of these systems are based on a fat-tree topology. However, the current InfiniBand fat-tree routing algorithm suffers from flaws that reduce its scalability and flexibility. Counter-intuitively, the achievable throughput per node deteriorates both when the number of nodes in a tree decreases or when the node distribution among leaves is nonuniform. In this paper, we identify the weaknesses of the current enhanced fat-tree routing algorithm in Open Fabrics Enterprise Distribution and we propose extensions to it that alleviate all performance problems related to node distribution. The new algorithm is implemented in OpenSM for real world evaluation and for future contribution to the Open Fabrics community. We demonstrate that our solution allows to achieve a predictable high throughput regardless of the number of nodes and their distribution. Furthermore, the simulations show that our extensions improve throughput up to 30% depending on topology size and node distribution.
  • Keywords
    telecommunication network routing; telecommunication network topology; workstation clusters; InfiniBand fabrics; OpenSM; fat-tree routing algorithm; fat-tree topology; fault-tolerance; full bisection bandwidth; imbalanced fat trees; inherent deadlock freedom; open fabrics enterprise distribution; predictable high performance; InfiniBand; fat-trees; interconnection networks; routing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Systems (ICPADS), 2010 IEEE 16th International Conference on
  • Conference_Location
    Shanghai
  • ISSN
    1521-9097
  • Print_ISBN
    978-1-4244-9727-0
  • Electronic_ISBN
    1521-9097
  • Type

    conf

  • DOI
    10.1109/ICPADS.2010.94
  • Filename
    5695626