• DocumentCode
    560155
  • Title

    High-performance lattice QCD for multi-core based parallel systems using a cache-friendly hybrid threaded-MPI approach

  • Author

    Smelyanskiy, Mikhail ; Vaidyanathan, Karthikeyan ; Choi, Jee ; Joó, Bálint ; Chhugani, Jatin ; Clark, Michael A. ; Dubey, Pradeep

  • fYear
    2011
  • fDate
    12-18 Nov. 2011
  • Firstpage
    1
  • Lastpage
    10
  • Abstract
    Lattice Quantum Chromo-dynamics (LQCD) is a computationally challenging problem that solves the discretized Dirac equation in the presence of an SU(3) gauge field. Its key operation is a matrix-vector product, known as the Dslash operator. We have developed a novel multicore architecture-friendly implementation of the Wilson-Dslash operator which delivers 75 Gflops (single-precision) on an Intel® Xeon® Processor X5680 achieving 60% computational efficiency for datasets that fit in the last-level cache. For datasets larger than the last-level cache, this performance drops to 50 Gflops. Our performance is 2-3X higher than a well-known implementation from the Chroma software suite when running on the same hardware platform. The novel implementation of LQCD reported in this paper is based on recently published the 3.5D spatial and 4.5D temporal tiling schemes. Both blocking schemes significantly reduce LQCD external memory bandwidth requirements, delivering a more compute-bound implementation. The performance advantage of our schemes will become more significant as the gap between compute flops and external memory bandwidth continues to grow. We demonstrate very good cluster-level scalability of our implementation: for a lattice of 32 x 256 sites, we achieve over 4 Tflops when strong-scaled to a 128 node system (1536 cores total). For the same lattice size, a full Conjugate Gradients Wilson-Dslash operator, achieves 2.95 Tflops.
  • Keywords
    Dirac equation; application program interfaces; cache storage; computer architecture; conjugate gradient methods; matrix algebra; message passing; multiprocessing systems; parallel processing; quantum computing; vectors; 3.5D spatial tiling scheme; 4.5D temporal tiling scheme; Chroma software suite; Dslash operator; Intel Xeon Processor X5680; LQCD external memory bandwidth requirement reduction; SU(3) gauge field; blocking schemes; cache-friendly hybrid threaded-MPI; cluster-level scalability; compute flops; conjugate gradients Wilson-Dslash operator; discretized Dirac equation; high-performance lattice QCD; lattice quantum chromo-dynamics; matrix-vector product; multicore architecture-friendly implementation; multicore based parallel systems; Bandwidth; Kernel; Lattices; Memory management; Multicore processing; Sockets; Vectors;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High Performance Computing, Networking, Storage and Analysis (SC), 2011 International Conference for
  • Conference_Location
    Seatle, WA
  • Electronic_ISBN
    978-1-4503-0771-0
  • Type

    conf

  • Filename
    6114421