Title :
Dymaxion: Optimizing memory access patterns for heterogeneous systems
Author :
Che, Shuai ; Sheaffer, Jeremy W. ; Skadron, Kevin
Author_Institution :
Dept. of Comput. Sci., Univ. of Virginia, Charlottesville, VA, USA
Abstract :
Graphics processors (GPUs) have emerged as an important platform for general purpose computing. GPUs offer a large number of parallel cores and have access to high memory bandwidth; however, data structure layouts in GPU memory often lead to sub optimal performance for programs designed with a CPU memory interface-or no particular memory interface at all!-in mind. This implies that application performance is highly sensitive irregularity in memory access patterns. This issue is all the more important due to the growing disparity between core and DRAM clocks; memory interfaces have increasingly become bottlenecks in computer systems. In this paper, we propose a simple API, Dymaxion , that allows programmers to optimize memory mappings to improve the efficiency of memory accesses on heterogeneous platforms. Use of Dymaxion requires only minimal modifications to existing CUDA programs. Our current framework extends NVIDIA´s CUDA API with the addition of memory layout remapping and index transformation. We consider the overhead of layout remapping and effectively hide it through chunking and overlapping with PCI-E transfer. We present the implementation of Dymaxion and its optimizations and evaluate a variety of important memory access patterns. Using four case studies, we are able to achieve 3.3x speedup on GPU kernels and 20% overall performance improvement, including the PCI-E transfer, over the original CUDA implementations on an NVIDIA GTX 480 GPU. We also explore the importance of maintaining per-device data layouts and cross-device data map pings with a case study of concurrent CPU-GPU execution.
Keywords :
DRAM chips; application program interfaces; concurrency control; coprocessors; data structures; parallel architectures; API; CPU memory interface; CUDA program; DRAM clocks; Dymaxion; GPU memory; NVIDIA GTX 480 GPU; PCI-E transfer; concurrent CPU-GPU execution; cross-device data map pings; data structure layouts; general purpose computing; graphics processor; heterogeneous system; index transformation; memory access pattern optimisation; memory access patterns; memory layout remapping; parallel cores; per-device data layouts; Arrays; Graphics processing unit; Indexes; Instruction sets; Kernel; Layout; GPGPU; Heterogeneous Computer Architectures; Latency Hiding; Memory Access and Data Layout;
Conference_Titel :
High Performance Computing, Networking, Storage and Analysis (SC), 2011 International Conference for
Conference_Location :
Seatle, WA
Electronic_ISBN :
978-1-4503-0771-0