DocumentCode :
1783291
Title :
High Performance Alltoall and Allgather Designs for InfiniBand MIC Clusters
Author :
Venkatesh, Akshay ; Potluri, Sreeram ; Rajachandrasekar, Raghunath ; Miao Luo ; Hamidouche, Khaled ; Panda, Dhabaleswar K.
Author_Institution :
Network-Based Comput. Lab., Ohio State Univ., Columbus, OH, USA
fYear :
2014
fDate :
19-23 May 2014
Firstpage :
637
Lastpage :
646
Abstract :
Intel´s Many-Integrated-Core (MIC) architecture aims to provide Teraflop throughput (through high degrees of parallelism) with a high FLOP/Watt ratio and x86 compatibility. However, this two-fold approach to solving power and programmability challenges for Exascale computing is constrained by certain architectural idiosyncrasies. MIC coprocessors have a memory constrained environment and its processors operate at slower clock rates. Also, being PCI devices, the communication characteristics of MIC co-processors are different compared to communication behavior seen in homogeneous environments. For instance, the performance of sending data from the MIC memory to a remote node´s memory through message passing routines has 3x-6x higher latency than sending from the host processor memory. Hence communication libraries that do not consider these architectural subtleties are likely to nullify performance benefits or even cause degradation in applications that intend to use MICs and rely heavily on communication routines. The performance of Message Passing Interface (MPI) operations, especially dense collective operations like All-to-all and All gather, strongly affect the performance of many distributed parallel applications. In this paper, we revisit state-of-the-art algorithms commonly used to implement All-to-all collectives and propose adaptations and optimizations to alleviate architectural bottlenecks on MIC clusters. We also propose a few novel designs to improve the communication latency of these operations. Through micro-benchmarks and applications, we substantiate the benefits of incorporating the proposed adaptations to the All-to-All collective operations. At the micro-benchmark level, the proposed designs show as much as 79% improvement for All gather operation and up to 70% improvement for All-to-all and with the P3DFFT application, an improvement of 38% is seen in overall execution time.
Keywords :
application program interfaces; coprocessors; message passing; parallel architectures; parallel programming; FLOP/Watt ratio; InfiniBand MIC Clusters; Intel MIC architecture; MIC coprocessors; MIC memory; MPI; P3DFFT application; PCI devices; all-to-all collective operations; architectural idiosyncrasies; communication characteristics; communication latency improvement; communication libraries; communication routines; distributed parallel applications; exascale computing; high performance allgather design; high performance alltoall design; many-integrated-core architecture; memory constrained environment; message passing interface; message passing routines; teraflop throughput; two-fold approach; x86 compatibility; Algorithm design and analysis; Clustering algorithms; Computer architecture; Coprocessors; Libraries; Microwave integrated circuits; Optimization; All-to-all; Allgather; Collectives; MPI; Many-Integrated-Core;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Processing Symposium, 2014 IEEE 28th International
Conference_Location :
Phoenix, AZ
ISSN :
1530-2075
Print_ISBN :
978-1-4799-3799-8
Type :
conf
DOI :
10.1109/IPDPS.2014.72
Filename :
6877296
Link To Document :
بازگشت