Title :
Scaling alltoall collective on multi-core systems
Author :
Kumar, Rahul ; Mamidala, Amith ; Panda, D.K.
Author_Institution :
Dept. of Comput. Sci. & Eng., Ohio State Univ., Columbus, OH
Abstract :
MPI_Alltoall is one of the most communication intense collective operation used in many parallel applications. Recently, the supercomputing arena has witnessed phenomenal growth of commodity clusters built using InfiniBand and multi-core systems. In this context, it is important to optimize this operation for these emerging clusters to allow for good application scaling. However, optimizing MPI_Alltoall on these emerging systems is not a trivial task. InfiniBand architecture allows for varying implementations of the network protocol stack. For example, the protocol can be totally on-loaded to a host processing core or it can be off-loaded onto the NIC or can use any combination of the two. Understanding the characteristics of these different implementations is critical in optimizing a communication intense operation such as MPI_Alltoall. In this paper, we systematically study these different architectures and propose new schemes for MPI_Alltoall tailored to these architectures. Specifically, we demonstrate that we cannot use one common scheme which performs optimally on each of these varying architectures. For example, on-loaded implementations can exploit multiple cores to achieve better network utilization, and in offload interfaces aggregation can be used to avoid congestion on multi-core systems. We employ shared memory aggregation techniques in these schemes and elucidate the impact of these schemes on multi-core systems. The proposed design achieves a reduction in MPI_Alltoall time by 55% for 512 Byte messages and speeds up the CPMD application by 33%.
Keywords :
computer architecture; message passing; parallel processing; shared memory systems; InfiniBand architecture; MPI_Alltoall; alltoall collective scaling; application scaling; communication intense collective operation; multicore systems; network protocol stack; network utilization; offload interface aggregation; parallel applications; shared memory aggregation technique; supercomputing; Application software; Computer science; Context; Delay; Multicore processing; Network interfaces; Protocols; Sun; Surges; US Department of Energy;
Conference_Titel :
Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on
Conference_Location :
Miami, FL
Print_ISBN :
978-1-4244-1693-6
Electronic_ISBN :
1530-2075
DOI :
10.1109/IPDPS.2008.4536141