Title :
Model-based, memory-centric performance and power optimization on NUMA multiprocessors
Author :
Chunyi Su ; Dong Li ; Nikolopoulos, Dimitrios S. ; Cameron, Kirk W. ; de Supinski, Bronis R. ; Leon, Edgar A.
Author_Institution :
Dept. of Comput. Sci., Virginia Tech, Blacksburg, VA, USA
Abstract :
Non-Uniform Memory Access (NUMA) architectures are ubiquitous in HPC systems. NUMA along with other factors including socket layout, data placement, and memory contention significantly increase the search space to find an optimal mapping of applications to NUMA systems. This search space may be intractable for online optimization and challenging for efficient offline search. This paper presents DyNUMA, a framework for dynamic optimization of programs on NUMA architectures. DyNUMA uses simple, memory-centric, performance and energy models with non-linear terms to capture the complex and interacting effects of system layout, program concurrency, data placement, and memory controller contention. DyNUMA leverages an artificial neural network (ANN) with input, output, and intermediate layers that emulate program threads, memory controllers, processor cores, and their interactions. Using an ANN in conjunction with critical path analysis, DyNUMA autonomously optimizes programs for performance or energy-efficiency metrics. We used DyNUMA on a variety of benchmarks from the NPB and ASC Sequoia suites on three different architectures (a 16-core AMD Barcelona system, a 32-core AMD Magny-Cours system, and a 64-core Tilera TilePro64 system). Our results show that DyNUMA achieves on average 8.7% improvement in performance (12.9% in the best case), 16% improvement in Energy-Delay (30.6% in the best case) and 9.1% improvement in MFLOPS/Watt (10.7% in the best case) compared to the default Linux scheduling.
Keywords :
circuit optimisation; concurrency control; energy conservation; memory architecture; microprocessor chips; multiprocessing systems; neural nets; parallel processing; parallel programming; performance evaluation; power aware computing; 16-core AMD Barcelona system; 32-core AMD Magny-Cours system; 64-core Tilera TilePro64 system; ANN; ASC Sequoia suites; DyNUMA; HPC systems; NPB Sequoia suites; NUMA multiprocessors; artificial neural network; critical path analysis; data placement; dynamic program optimization; efficient offline search; energy-delay; energy-efficiency metrics; linux scheduling; memory contention; memory controller contention; memory controllers; model-based memory-centric performance optimization; model-based memory-centric power optimization; nonlinear terms; nonuniform memory access architectures; online optimization; optimal mapping; processor cores; program concurrency; program threads; search space; socket layout; Artificial neural networks; Concurrent computing; Hardware; Instruction sets; Measurement; Optimization; Sockets;
Conference_Titel :
Workload Characterization (IISWC), 2012 IEEE International Symposium on
Conference_Location :
La Jolla, CA
Print_ISBN :
978-1-4673-4531-6
DOI :
10.1109/IISWC.2012.6402921