Title :
NUMA Aware Iterative Stencil Computations on Many-Core Systems
Author :
Shaheen, Mohammed ; Strzodka, Robert
Author_Institution :
Integrative Sci. Comput. Group, Max Planck Inst. Inf., Saarbrucken, Germany
Abstract :
Temporal blocking in iterative stencil computations allows to surpass the performance of peak system bandwidth that holds for a single stencil computation. However, the effectiveness of temporal blocking depends strongly on the tiling scheme, which must account for the contradicting goals of spatio-temporal data locality, regular memory access patterns, parallelization into many independent tasks, and datato-core affinity for NUMA-aware data distribution. Despite the prevalence of cache coherent non-uniform memory access (ccNUMA) in todays many-core systems, this latter aspect has been largely ignored in the development of temporal blocking algorithms. Building upon previous cache-aware [1] and cacheoblivious [2] schemes, this paper develops their NUMA-aware variants, explaining why the incorporation of data-to-core affinity as an equally important goal necessitates also new tiling and parallelization strategies. Results are presented on an 8 socket dual-core and a 4 socket oct-core systems and compared against an optimized naive scheme, various peak performance characteristics, and related schemes from literature.
Keywords :
cache storage; iterative methods; multiprocessing systems; 4 socket oct-core system; 8 socket dual-core system; NUMA-aware data distribution; NUMA-aware variant; cache coherent nonuniform memory access; cache-aware scheme; cache-oblivious scheme; contradicting goal; data-to-core affinity; independent task; many-core system; nonuniform memory access aware iterative stencil computation; parallelization strategy; regular memory access pattern; spatio-temporal data locality; temporal blocking algorithm; tiling scheme; tiling strategy; Bandwidth; Cats; Instruction sets; Kernel; Scalability; Synchronization; Tiles; NUMA-aware data distribution; affinity; cache-aware; cache-oblivious; parallelism and locality; stencil computation; temporal blocking;
Conference_Titel :
Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International
Conference_Location :
Shanghai
Print_ISBN :
978-1-4673-0975-2
DOI :
10.1109/IPDPS.2012.50