Title :
GMProf: A low-overhead, fine-grained profiling approach for GPU programs
Author :
Mai Zheng ; Ravi, Vignesh T. ; Wenjing Ma ; Feng Qin ; Agrawal, Gagan
Author_Institution :
Dept. of Comput. Sci. & Eng., Ohio State Univ., Columbus, OH, USA
Abstract :
Driven by the cost-effectiveness and the power-efficiency, GPUs are being increasingly used to accelerate computations in many domains. However, developing highly efficient GPU implementations requires a lot of expertise and effort. Thus, tool support for tuning GPU programs is urgently needed, and more specifically, low-overhead mechanisms for collecting fine-grained runtime information are critically required. Unfortunately, profiling tools and mechanisms available today either collect very coarse-grained information, or have prohibitive overheads. This paper presents a low-overhead and fine-grained profiling technique developed specifically for GPUs, which we refer to as GMProf. GMProf uses two ideas to help reduce the overheads of collecting fine-grained information. The first idea involves exploiting a number of GPU architectural features to collect reasonably accurate information very efficiently, and the second idea is to use simple static analysis methods to reduce the overhead of runtime profiling. The specific implementation of GMProf we report in this paper focuses on shared memory usage. Particularly, we help programmers understand (1) which locations in shared memory are infrequently accessed? and (2) which data elements in device memory are frequently accessed? We have evaluated GMProf using six popular GPU kernels with different characteristics. Our experimental results show that GMProf, with all optimizations, incurs a moderate overhead, e.g., 1.36 times on average for shared memory profiling. Furthermore, for three of the six evaluated kernels, GMProf verified that shared memory is effectively used, and for the remaining three kernels, it not only helped accurately identify the inefficient use of shared memory, but also helped tune the implementations. The resulting tuned implementations had a speedup of 15.18 times on average.
Keywords :
data flow analysis; graphics processing units; shared memory systems; GMProf; GPU architectural features; GPU kernels; coarse-grained information; data elements; device memory; fine-grained profiling approach; fine-grained profiling technique; fine-grained runtime information collecting; highly efficient GPU implementations; low-overhead mechanisms; low-overhead profiling approach; low-overhead profiling technique; optimizations; power-efficiency; profiling tools; runtime profiling; shared memory profiling; shared memory usage; static analysis methods; tool support; tuning GPU programs;
Conference_Titel :
High Performance Computing (HiPC), 2012 19th International Conference on
Conference_Location :
Pune
Print_ISBN :
978-1-4673-2372-7
Electronic_ISBN :
978-1-4673-2370-3
DOI :
10.1109/HiPC.2012.6507475