• DocumentCode
    1854433
  • Title

    Analyzing the Effects of Multicore Architectures and On-Host Communication Characteristics on Collective Communications

  • Author

    Ladd, Joshua ; Venkata, Manjunath Gorentla ; Graham, Richard ; Shamis, Pavel

  • Author_Institution
    Oak Ridge Nat. Lab., Oak Ridge, TN, USA
  • fYear
    2011
  • fDate
    13-16 Sept. 2011
  • Firstpage
    406
  • Lastpage
    415
  • Abstract
    Shared memory optimizations for blocking collective communications implemented for multi-core, and distributed systems have previously shown to improve the performance of these operations. Such previous studies have tended to neglect the architecture of multi-core node and shared-memory communication characteristics. In this paper, we examine in detail the impact of on-node memory and cache hierarchy, and the optimization opportunities these provide, on the performance of the barrier and broadcast collective operations. The primary contribution of this paper is the demonstration of how exploiting the local memory-hierarchy impacts the performance of these operations in the distributed system context. Our results show that factors such as the location of communicating process in the node, number of communication processes, amount of shared-memory communication, and the amount of inter-socket (Central Processing Unit (CPU) socket) communication affect latency-sensitive and bandwidth-sensitive collective operations. The effect of these parameters varies on the type of operations, and are coupled to the architecture of the shared-memory node and the scale of collective operation. We have seen that for 3,072 processes on Jaguar, and considering the socket layout in collective communication algorithm improves the large-data MPI Bcast () performance by 50% and MPI Barrier by 40% when compared to neglecting this architectural feature. For 512 processes job on Smoky, the corresponding improvement is 38%, and an order of magnitude, respectively. Small data broadcast performance is not noticeably impacted on Jaguar, when considering the shared-memory hierarchy, and on Smoky the corresponding performance improvement is 3%.
  • Keywords
    cache storage; computer architecture; distributed shared memory systems; message passing; multiprocessing systems; performance evaluation; Jaguar; Smoky; cache hierarchy; central processing unit socket; collective communications; distributed systems; inter-socket; large-data MPI Bcast; multicore architectures; on-host communication characteristics; on-node memory; shared memory optimizations; shared-memory communication characteristics; Algorithm design and analysis; Multicore processing; Optimization; Scalability; Sockets; Topology; Collective Communications; MPI; Shared Memory;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel Processing Workshops (ICPPW), 2011 40th International Conference on
  • Conference_Location
    Taipei City
  • ISSN
    1530-2016
  • Print_ISBN
    978-1-4577-1337-8
  • Electronic_ISBN
    1530-2016
  • Type

    conf

  • DOI
    10.1109/ICPPW.2011.15
  • Filename
    6047050