DocumentCode :
2980723
Title :
Model-driven Level 3 BLAS Performance Optimization on Loongson 3A Processor
Author :
Zhang Xianyi ; Wang Qian ; Zhang Yunquan
Author_Institution :
Lab. of Parallel Software & Comput. Sci., Inst. of Software, Beijing, China
fYear :
2012
fDate :
17-19 Dec. 2012
Firstpage :
684
Lastpage :
691
Abstract :
Every mainstream processor vendor provides an optimized BLAS implementation for its CPU, as BLAS is a fundamental math library in scientific computing. The Loongson 3A CPU is a general-purpose 64-bit MIPS64 quad-core processor, developed by the Institute of Computing Technology, Chinese Academy of Sciences. To date, there has not been a sufficiently optimized BLAS on the Loongson 3A CPU. The purpose of this research is to optimize level 3 BLAS performance on the Loongson 3A CPU. We analyzed the Loongson 3A architecture and built a performance model to highlight the key point, L1 data cache misses, which is different from level 3 BLAS optimization on the mainstream x86 CPU. Therefore, we employed a variety of methods to avoid L1 cache misses in single thread optimization, including cache and register blocking, the Loongson 3A 128-bit memory accessing extension instructions, software prefetching, and single precision floating-point SIMD instructions. Furthermore, we improved parallel performance by reducing bank conflicts among multiple threads in the shared L2 cache. We created an open source BLAS project, OpenBLAS, to demonstrate the performance improvement on the Loongson 3A quad-core processor.
Keywords :
cache storage; floating point arithmetic; linear algebra; mathematics computing; multi-threading; multiprocessing systems; optimisation; parallel architectures; public domain software; shared memory systems; software performance evaluation; Chinese Academy of Sciences; Institute of Computing Technology; L1 data cache misses; Loongson 3A 128-bit memory; Loongson 3A architecture; Loongson 3A quadcore processor; OpenBLAS; bank conflicts; basic linear algebra subprograms; extension instructions; fundamental math library; general-purpose 64-bit MIPS64 quadcore processor; mainstream processor vendor; mainstream x86 CPU; model-driven level 3 BLAS performance optimization; multiple threads; open source BLAS project; parallel performance improvement; performance model; register blocking; scientific computing; single precision floating point SIMD instructions; single thread optimization; software prefetching; word length 128 bit; word length 64 bit; Kernel; Optimization; Pipelines; Prefetching; Registers; BLAS; Loongson 3A; MIPS64; Multi-core; Optimization;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Systems (ICPADS), 2012 IEEE 18th International Conference on
Conference_Location :
Singapore
ISSN :
1521-9097
Print_ISBN :
978-1-4673-4565-1
Electronic_ISBN :
1521-9097
Type :
conf
DOI :
10.1109/ICPADS.2012.97
Filename :
6413635
Link To Document :
بازگشت