• DocumentCode
    255960
  • Title

    Optimizing MPI collectives on intel MIC through effective use of cache

  • Author

    Panigrahi, P. ; Kanchiraju, S. ; Srinivasan, A. ; Baruah, P.K. ; Sudheer, C.D.

  • Author_Institution
    Dept. Of Math. & Comput. Sci., Sri Sathya Sai Inst. of Higher Learning, Prashanthi Nilayam, India
  • fYear
    2014
  • fDate
    11-13 Dec. 2014
  • Firstpage
    88
  • Lastpage
    93
  • Abstract
    The Intel MIC architecture, implemented in the Xeon Phi coprocessor, is targeted at highly parallel applications. In order to exploit it, one needs to make full use of simultaneous multi-threading, which permits four simultaneous threads per core. Our results also show that distributed tag directories can be a greater bottleneck than the ring for small messages when multiple threads access the same cache line. Careful design of algorithms and implementations based on these results can yield substantial performance improvement. We demonstrate these ideas by optimizing MPI collective calls. We obtain a speedup of 9x on barrier and a speed-up of 10x on broadcast, when compared with Intel´s MPI implementation. We also show the usefulness of our collectives in two realistic codes: particle transport and the load balancing phase in QMC. Another important contribution of our work lies in showing that optimization techniques - such as double buffering - used with programmer controlled caches are also useful on MIC. These results can help optimize other communication intensive codes running on MIC.
  • Keywords
    cache storage; coprocessors; message passing; multi-threading; resource allocation; shared memory systems; Intel MIC architecture; Intel MPI implementation; MPI collective call; MPI collectives; QMC; Xeon Phi coprocessor; cache line; communication intensive code; distributed tag directory; double buffering; load balancing phase; multithreading; optimization technique; parallel application; particle transport; performance improvement; programmer controlled cache; realistic code; Algorithm design and analysis; Bandwidth; Benchmark testing; Computer architecture; Grid computing; Message systems; Microwave integrated circuits; Intel Xeon Phi; MIC; MPI; barrier; broadcast; double buffering; shared memory;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel, Distributed and Grid Computing (PDGC), 2014 International Conference on
  • Conference_Location
    Solan
  • Print_ISBN
    978-1-4799-7682-9
  • Type

    conf

  • DOI
    10.1109/PDGC.2014.7030721
  • Filename
    7030721