Analyzing the Effects of Multicore Architectures and On-Host Communication Characteristics on Collective Communications

Author

Ladd, Joshua ; Venkata, Manjunath Gorentla ; Graham, Richard ; Shamis, Pavel

Author_Institution

Oak Ridge Nat. Lab., Oak Ridge, TN, USA

fYear

2011

fDate

13-16 Sept. 2011

Firstpage

406

Lastpage

415

Abstract

Shared memory optimizations for blocking collective communications implemented for multi-core, and distributed systems have previously shown to improve the performance of these operations. Such previous studies have tended to neglect the architecture of multi-core node and shared-memory communication characteristics. In this paper, we examine in detail the impact of on-node memory and cache hierarchy, and the optimization opportunities these provide, on the performance of the barrier and broadcast collective operations. The primary contribution of this paper is the demonstration of how exploiting the local memory-hierarchy impacts the performance of these operations in the distributed system context. Our results show that factors such as the location of communicating process in the node, number of communication processes, amount of shared-memory communication, and the amount of inter-socket (Central Processing Unit (CPU) socket) communication affect latency-sensitive and bandwidth-sensitive collective operations. The effect of these parameters varies on the type of operations, and are coupled to the architecture of the shared-memory node and the scale of collective operation. We have seen that for 3,072 processes on Jaguar, and considering the socket layout in collective communication algorithm improves the large-data MPI Bcast () performance by 50% and MPI Barrier by 40% when compared to neglecting this architectural feature. For 512 processes job on Smoky, the corresponding improvement is 38%, and an order of magnitude, respectively. Small data broadcast performance is not noticeably impacted on Jaguar, when considering the shared-memory hierarchy, and on Smoky the corresponding performance improvement is 3%.

Keywords

cache storage; computer architecture; distributed shared memory systems; message passing; multiprocessing systems; performance evaluation; Jaguar; Smoky; cache hierarchy; central processing unit socket; collective communications; distributed systems; inter-socket; large-data MPI Bcast; multicore architectures; on-host communication characteristics; on-node memory; shared memory optimizations; shared-memory communication characteristics; Algorithm design and analysis; Multicore processing; Optimization; Scalability; Sockets; Topology; Collective Communications; MPI; Shared Memory;

fLanguage

English

Publisher

ieee

Conference_Titel

Parallel Processing Workshops (ICPPW), 2011 40th International Conference on

Conference_Location

Taipei City

ISSN

1530-2016

Print_ISBN

978-1-4577-1337-8

Electronic_ISBN

1530-2016

Type

conf

DOI

10.1109/ICPPW.2011.15

Filename

6047050