• DocumentCode
    3706494
  • Title

    Design and Implementation of a Highly Efficient DGEMM for 64-Bit ARMv8 Multi-core Processors

  • Author

    Feng Wang;Hao Jiang;Ke Zuo;Xing Su;Jingling Xue;Canqun Yang

  • Author_Institution
    Sch. of Comput. Sci., Nat. Univ. of Defense Technol., Changsha, China
  • fYear
    2015
  • Firstpage
    200
  • Lastpage
    209
  • Abstract
    This paper presents the design and implementation of a highly efficient Double-precision General Matrix Multiplication (DGEMM) based on Open BLAS for 64-bit ARMv8 eight-core processors. We adopt a theory-guided approach by first developing a performance model for this architecture and then using it to guide our exploration. The key enabler for a highly efficient DGEMM is a highly-optimized inner kernel GEBP developed in assembly language. We have obtained GEBP by (1) maximizing its compute-to-memory access ratios across all levels of the memory hierarchy in the ARMv8 architecture with its performance-critical block sizes being determined analytically, and (2) optimizing its computations through exploiting loop unrolling, instruction scheduling and software-implemented register rotation and taking advantage of A64 instructions to support efficient FMA operations, data transfers and prefetching. We have compared our DGEMM implemented in Open BLAS with another implemented in ATLAS (also in terms of a highly-optimized GEBP in assembly). Our implementation outperforms the one in ALTAS by improving the peak performance (efficiency) of DGEMM from 3.88 Gflops (80.9%) to 4.19 Gflops (87.2%) on one core and from 30.4 Gflops (79.2%) to 32.7 Gflops (85.3%) on eight cores. These results translate into substantial performance (efficiency) improvements by 7.79% on one core and 7.70% on eight cores. In addition, the efficiency of our implementation on one core is very close to the theoretical upper bound 91.5% obtained from micro-benchmarking. Our parallel implementation achieves good performance and scalability under varying thread counts across a range of matrix sizes evaluated.
  • Keywords
    "Registers","Kernel","Computational modeling","Program processors","Assembly","Memory management"
  • Publisher
    ieee
  • Conference_Titel
    Parallel Processing (ICPP), 2015 44th International Conference on
  • ISSN
    0190-3918
  • Type

    conf

  • DOI
    10.1109/ICPP.2015.29
  • Filename
    7349575