Title :
Guided Region-Based GPU Scheduling: Utilizing Multi-thread Parallelism to Hide Memory Latency
Author :
Jianmin Chen ; Xi Tao ; Zhen Yang ; Jih-Kwon Peir ; Xiaoyuan Li ; Shih-Lien Lu
Author_Institution :
Dept. of CISE, Univ. of Florida, Gainesville, FL, USA
Abstract :
Modern General-Purpose computation on Graphics Processing Units (GPGPUs) explore parallelism in applications by building massively parallel architecture and apply multithreading technology to hide the instruction and memory latencies. Such architectures become increasingly popular for parallel applications using CUDA/OpenCL programming languages. In this paper, we investigate thread scheduling algorithms on such highly-threaded GPGPUs. The traditional round-robin scheduling schemes are inefficient in handling instruction execution and memory accesses with disparate latencies. We introduce a new GPGPU thread (warp) scheduling algorithm which enables flexible roundrobin distance for efficiently utilizing multithread parallelism and use program-guided priority shift among concurrent threads (warps) to allow more overlaps between short-latency compute instructions and long-latency memory accesses. Performance evaluations demonstrate that the new scheduling algorithm improves a set of kernel execution times by an average of 12% with 52% reduction on scheduler stall cycles over the fine-granularity round-robin scheme. In this paper, we also accomplish a thorough evaluation of various thread scheduling algorithms based on the amount of hardware threads, the scheduling overhead, and the global memory latency.
Keywords :
concurrency control; graphics processing units; multi-threading; parallel architectures; performance evaluation; processor scheduling; CUDA programming language; GPGPU thread scheduling algorithm; GPGPU warp scheduling algorithm; OpenCL programming language; concurrent thread; concurrent warp; fine-granularity round-robin scheme; general-purpose computation-on-graphics processing units; global memory latency; guided region-based GPU scheduling; hardware threads; instruction execution; instruction hiding; kernel execution time improvement; long-latency memory access; massively-parallel architecture; memory latency hiding; multithread parallelism; performance evaluation; program-guided priority shift; round-robin distance; scheduler stall cycle reduction; scheduling overhead; short-latency compute instructions; thread scheduling algorithms; Graphics processing units; Instruction sets; Kernel; Scheduling; Scheduling algorithms; CUDA; GPGPU; multi-thread; thread (warp) scheduling;
Conference_Titel :
Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on
Conference_Location :
Boston, MA
Print_ISBN :
978-1-4673-6066-1
DOI :
10.1109/IPDPS.2013.95