Design and Implementation of a Highly Efficient DGEMM for 64-Bit ARMv8 Multi-core Processors

Author

Feng Wang;Hao Jiang;Ke Zuo;Xing Su;Jingling Xue;Canqun Yang

Author_Institution

Sch. of Comput. Sci., Nat. Univ. of Defense Technol., Changsha, China

fYear

2015

Firstpage

200

Lastpage

209

Abstract

This paper presents the design and implementation of a highly efficient Double-precision General Matrix Multiplication (DGEMM) based on Open BLAS for 64-bit ARMv8 eight-core processors. We adopt a theory-guided approach by first developing a performance model for this architecture and then using it to guide our exploration. The key enabler for a highly efficient DGEMM is a highly-optimized inner kernel GEBP developed in assembly language. We have obtained GEBP by (1) maximizing its compute-to-memory access ratios across all levels of the memory hierarchy in the ARMv8 architecture with its performance-critical block sizes being determined analytically, and (2) optimizing its computations through exploiting loop unrolling, instruction scheduling and software-implemented register rotation and taking advantage of A64 instructions to support efficient FMA operations, data transfers and prefetching. We have compared our DGEMM implemented in Open BLAS with another implemented in ATLAS (also in terms of a highly-optimized GEBP in assembly). Our implementation outperforms the one in ALTAS by improving the peak performance (efficiency) of DGEMM from 3.88 Gflops (80.9%) to 4.19 Gflops (87.2%) on one core and from 30.4 Gflops (79.2%) to 32.7 Gflops (85.3%) on eight cores. These results translate into substantial performance (efficiency) improvements by 7.79% on one core and 7.70% on eight cores. In addition, the efficiency of our implementation on one core is very close to the theoretical upper bound 91.5% obtained from micro-benchmarking. Our parallel implementation achieves good performance and scalability under varying thread counts across a range of matrix sizes evaluated.

Keywords

"Registers","Kernel","Computational modeling","Program processors","Assembly","Memory management"

Publisher

ieee

Conference_Titel

Parallel Processing (ICPP), 2015 44th International Conference on

ISSN

0190-3918

Type

conf

DOI

10.1109/ICPP.2015.29

Filename

7349575