DocumentCode :
1854433
Title :
Analyzing the Effects of Multicore Architectures and On-Host Communication Characteristics on Collective Communications
Author :
Ladd, Joshua ; Venkata, Manjunath Gorentla ; Graham, Richard ; Shamis, Pavel
Author_Institution :
Oak Ridge Nat. Lab., Oak Ridge, TN, USA
fYear :
2011
fDate :
13-16 Sept. 2011
Firstpage :
406
Lastpage :
415
Abstract :
Shared memory optimizations for blocking collective communications implemented for multi-core, and distributed systems have previously shown to improve the performance of these operations. Such previous studies have tended to neglect the architecture of multi-core node and shared-memory communication characteristics. In this paper, we examine in detail the impact of on-node memory and cache hierarchy, and the optimization opportunities these provide, on the performance of the barrier and broadcast collective operations. The primary contribution of this paper is the demonstration of how exploiting the local memory-hierarchy impacts the performance of these operations in the distributed system context. Our results show that factors such as the location of communicating process in the node, number of communication processes, amount of shared-memory communication, and the amount of inter-socket (Central Processing Unit (CPU) socket) communication affect latency-sensitive and bandwidth-sensitive collective operations. The effect of these parameters varies on the type of operations, and are coupled to the architecture of the shared-memory node and the scale of collective operation. We have seen that for 3,072 processes on Jaguar, and considering the socket layout in collective communication algorithm improves the large-data MPI Bcast () performance by 50% and MPI Barrier by 40% when compared to neglecting this architectural feature. For 512 processes job on Smoky, the corresponding improvement is 38%, and an order of magnitude, respectively. Small data broadcast performance is not noticeably impacted on Jaguar, when considering the shared-memory hierarchy, and on Smoky the corresponding performance improvement is 3%.
Keywords :
cache storage; computer architecture; distributed shared memory systems; message passing; multiprocessing systems; performance evaluation; Jaguar; Smoky; cache hierarchy; central processing unit socket; collective communications; distributed systems; inter-socket; large-data MPI Bcast; multicore architectures; on-host communication characteristics; on-node memory; shared memory optimizations; shared-memory communication characteristics; Algorithm design and analysis; Multicore processing; Optimization; Scalability; Sockets; Topology; Collective Communications; MPI; Shared Memory;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel Processing Workshops (ICPPW), 2011 40th International Conference on
Conference_Location :
Taipei City
ISSN :
1530-2016
Print_ISBN :
978-1-4577-1337-8
Electronic_ISBN :
1530-2016
Type :
conf
DOI :
10.1109/ICPPW.2011.15
Filename :
6047050
Link To Document :
بازگشت