DocumentCode :
3206319
Title :
A Communication-Avoiding, Hybrid-Parallel, Rank-Revealing Orthogonalization Method
Author :
Hoemmen, Mark
Author_Institution :
Scalable Algorithms Dept., Sandia Nat. Labs., Albuquerque, NM, USA
fYear :
2011
fDate :
16-20 May 2011
Firstpage :
966
Lastpage :
977
Abstract :
Orthogonalization consumes much of the run time of many iterative methods for solving sparse linear systems and eigenvalue problems. Commonly used algorithms, such as variants of Gram-Schmidt or Householder QR, have performance dominated by communication. Here, "communication" includes both data movement between the CPU and memory, and messages between processors in parallel. Our Tall Skinny QR (TSQR) family of algorithms requires asymptotically fewer messages between processors and data movement between CPU and memory than typical orthogonalization methods, yet achieves the same accuracy as Householder QR factorization. Furthermore, in block orthogonalizations, TSQR is faster and more accurate than existing approaches for orthogonalizing the vectors within each block ("normalization"). TSQR\´s rank-revealing capability also makes it useful for detecting deflation in block iterative methods, for which existing approaches sacrifice performance, accuracy, or both. We have implemented a version of TSQR that exploits both distributed-memory and shared-memory parallelism, and supports real and complex arithmetic. Our implementation is optimized for the case of orthogonalizing a small number (5 -- 20) of very long vectors. The shared-memory parallel component uses Intel\´s Threading Building Blocks, though its modular design supports other shared-memory programming models as well, including computation on the GPU. Our implementation achieves speedups of 2 times or more over competing orthogonalizations. It is available now in the development branch of the Trilinos software package, and will be included in the 10.8 release.
Keywords :
distributed memory systems; eigenvalues and eigenfunctions; iterative methods; mathematics computing; matrix decomposition; software packages; CPU; GPU; Gram-Schmidt; Householder QR factorization; Intel threading building blocks; TSQR; Tall Skinny QR; Trilinos software package; block iterative methods; communication-avoiding orthogonalization method; data movement; deflation detection; distributed-memory parallelism; eigenvalue problems; hybrid-parallel orthogonalization method; rank-revealing orthogonalization method; shared-memory parallelism; shared-memory programming models; sparse linear systems; Accuracy; Instruction sets; Iterative methods; Kernel; Parallel processing; Q factor; Tuning;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International
Conference_Location :
Anchorage, AK
ISSN :
1530-2075
Print_ISBN :
978-1-61284-372-8
Electronic_ISBN :
1530-2075
Type :
conf
DOI :
10.1109/IPDPS.2011.93
Filename :
6012905
Link To Document :
بازگشت