Abstract :
Following current IC design technology trend, modern GPUs integrate more and more processing cores, and the speed gap between processor and memory system becomes even larger. As the number of cores continually increases, the available bandwidth per core decreases correspondingly. Therefore, memory access performance has been one of the most critical bottlenecks for better performance. This paper analyzes the impact of memory system on performance and scalability for GPU with several scientific applications using a cycle-accurate simulator. Two observations we make are (1) that memory bandwidth has relatively greater impact on performance than memory latency, because the latter factor could be well hidden with tremendous concurrent executing threads supported in modern GPU architecture, and (2) that through examining the performance scalability of variable active cores, using the maximum hardware-supported cores may not bring in better performance, especially for the memory-intensive applications. In the end we suggest a better power-efficient exploitation of GPU is to make judicious concurrency-throttling based on the memory usage in application.