DocumentCode :
2957864
Title :
A Comprehensive Study of Task Coalescing for Selecting Parallelism Granularity in a Two-Stage Bidiagonal Reduction
Author :
Haidar, Azzam ; Ltaief, Hatem ; Luszczek, Piotr ; Dongarra, Jack
Author_Institution :
Innovative Comput. Lab., Univ. of Tennessee, Knoxville, TN, USA
fYear :
2012
fDate :
21-25 May 2012
Firstpage :
25
Lastpage :
35
Abstract :
We present new high performance numerical kernels combined with advanced optimization techniques that significantly increase the performance of parallel bidiagonal reduction. Our approach is based on developing efficient fine-grained computational tasks as well as reducing overheads associated with their high-level scheduling during the so-called bulge chasing procedure that is an essential phase of a scalable bidiagonalization procedure. In essence, we coalesce multiple tasks in a way that reduces the time needed to switch execution context between the scheduler and useful computational tasks. At the same time, we maintain the crucial information about the tasks and their data dependencies between the coalescing groups. This is the necessary condition to preserve numerical correctness of the computation. We show our annihilation strategy based on multiple applications of single orthogonal reflectors. Despite non-trivial characteristics in computational complexity and memory access patterns, our optimization approach smoothly applies to the annihilation scenario. The coalescing positively influences another equally important aspect of the bulge chasing stage: the memory reuse. For the tasks within the coalescing groups, the data is retained in high levels of the cache hierarchy and, as a consequence, operations that are normally memory-bound increase their ratio of computation to off-chip communication and become compute-bound which renders them amenable to efficient execution on multicore architectures. The performance for the new two-stage bidiagonal reduction is staggering. Our implementation results in up to 50-fold and 12-fold improvement (~130 Gflop/s) compared to the equivalent routines from LAPACK V3.2 and Intel MKL V10.3, respectively, on an eight socket hexa-core AMD Opteron multicore shared-memory system with a matrix size of 24000 × 24000. Last but not least, we provide a comprehensive study on the impact of the coalescing group size in terms of- cache utilization and power consumption in the context of this new two-stage bidiagonal reduction.
Keywords :
cache storage; computational complexity; mathematics computing; matrix algebra; optimisation; scheduling; shared memory systems; singular value decomposition; advanced optimization technique; annihilation scenario; annihilation strategy; bulge chasing procedure; bulge chasing stage; cache hierarchy; cache utilization; coalescing group; computational complexity; data dependency; fine-grained computational task; hexa-core AMD Opteron multicore shared-memory system; high performance numerical kernel; high-level scheduling; memory access pattern; memory reusing; multicore architecture; numerical correctness; off-chip communication; orthogonal reflector; parallel bidiagonal reduction; parallelism granularity selection; power consumption; rectangular dense matrix; scalable bidiagonalization procedure; singular value decomposition; task coalescing; two-stage bidiagonal reduction; Context; Eigenvalues and eigenfunctions; Kernel; Layout; Multicore processing; Symmetric matrices; Tiles; Bidiagonal Reduction; Bulge Chasing; Dynamic Scheduling; Granularity Analysis; Power Profiling; Tile Algorithms; Two-Stage Approach;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International
Conference_Location :
Shanghai
ISSN :
1530-2075
Print_ISBN :
978-1-4673-0975-2
Type :
conf
DOI :
10.1109/IPDPS.2012.13
Filename :
6267821
Link To Document :
بازگشت