Title :
(Mis)understanding the NUMA memory system performance of multithreaded workloads
Author :
Majo, Zoltan ; Gross, Thomas R.
Author_Institution :
Dept. of Comput. Sci., ETH Zurich, Zurich, Switzerland
Abstract :
An important aspect of workload characterization is understanding memory system performance (i.e., understanding a workload´s interaction with the memory system). On systems with a non-uniform memory architecture (NUMA) the performance critically depends on the distribution of data and computations. The actual memory access patterns have a large influence on performance on systems with aggressive prefetcher units. This paper describes an analysis of the memory system performance of multithreaded programs and shows that some programs are (unintentionally) structured so that they use the memory system of today´s NUMA-multicores inefficiently: Programs exhibit program-level data sharing, a performance-limiting factor that makes data and computation distribution in NUMA systems difficult. Moreover, many programs have irregular memory access patterns that are hard to predict by processor prefetcher units. The memory system performance as observed for a given program on a specific platform depends also on many algorithm and implementation decisions. The paper shows that a set of simple algorithmic changes coupled with commonly available OS functionality suffice to eliminate data sharing and to regularize the memory access patterns for a subset of the PARSEC parallel benchmarks. These simple source-level changes result in performance improvements of up to 3.1X, but more importantly, they lead to a fairer and more accurate performance evaluation on NUMA-multicore systems. They also illustrate the importance of carefully considering all details of algorithms and architectures to avoid drawing incorrect conclusions.
Keywords :
multi-threading; multiprocessing systems; operating systems (computers); parallel architectures; NUMA memory system performance; NUMA-multicores; OS functionality; PARSEC parallel benchmarks; computation distribution; data distribution; memory access patterns; multithreaded programs; multithreaded workloads; nonuniform memory architecture; workload characterization; Benchmark testing; Computer architecture; Microarchitecture; Prefetching; Process control; System performance;
Conference_Titel :
Workload Characterization (IISWC), 2013 IEEE International Symposium on
Conference_Location :
Portland, OR
Print_ISBN :
978-1-4799-0553-9
DOI :
10.1109/IISWC.2013.6704666