DocumentCode
1650475
Title
Reproducible Tall-Skinny QR
Author
Demmel, James ; Hong Diep Nguyen
Author_Institution
Univ. of California, Berkeley, Berkeley, CA, USA
fYear
2015
Firstpage
152
Lastpage
159
Abstract
Reproducibility is the ability to obtain bitwise identical results from different runs of the same program on the same input data, regardless of the available computing resources, or how they are scheduled. Recently, techniques have been proposed to attain reproducibility for BLAS operations, all of which rely on reproducibly computing the floating-point sum and dot product. Nonetheless, a reproducible BLAS library does not automatically translate into a reproducible higher-level linear algebra library, especially when communication is optimized. For instance, for the QR factorization, conventional algorithms such as Householder transformation or Gram-Schmidt process can be used to reproducibly factorize a floating-point matrix by fixing the high-level order of computation, for example column-by-column from left to right, and by using reproducible versions of level-1 BLAS operations such as dot product and 2-norm. In a massively parallel environment, those algorithms have high communication cost due to the need for synchronization after each step. The Tall-Skinny QR algorithm obtains much better performance in massively parallel environments by reducing the number of messages by a factor of n to O(log(P)) where P is the processor count, by reducing the number of reduction operations to O(1). Those reduction operations however are highly dependent on the network topology, in particular the number of computing nodes, and therefore are difficult to implement reproducibly and with reasonable performance. In this paper we present a new technique to reproducibly compute a QR factorization for a tall skinny matrix, which is based on the Cholesky QR algorithm to attain reproducibility as well as to improve communication cost, and the iterative refinement technique to guarantee the accuracy of the computed results. Our technique exhibits strong scalability in massively parallel environments, and at the same time can provide results of almost the same accuracy as the convent- onal Householder QR algorithm unless the matrix is extremely badly conditioned, in which case a warning can be given. Initial experimental results in Matlab show that for not too ill-conditioned matrices whose condition number is smaller than sqrt(1/e) where e is the machine epsilon, our technique runs less than 4 times slower than the built-in Matlab qr() function, and always computes numerically stable results in terms of column-wise relative error.
Keywords
iterative methods; mathematics computing; matrix decomposition; parallel processing; BLAS operation; Cholesky QR algorithm; Gram-Schmidt process; Matlab; QR factorization; column-wise relative error; dot product; floating-point matrix factorization; floating-point sum; householder transformation; iterative refinement technique; level-1 BLAS operations; massively parallel environment; network topology; reproducible BLAS library; reproducible higher-level linear algebra library; reproducible tall-skinny QR algorithm; Floating-Point Computation; Linear Algebra; Reproducibility;
fLanguage
English
Publisher
ieee
Conference_Titel
Computer Arithmetic (ARITH), 2015 IEEE 22nd Symposium on
Conference_Location
Lyon
ISSN
1063-6889
Print_ISBN
978-1-4799-8663-7
Type
conf
DOI
10.1109/ARITH.2015.28
Filename
7203810
Link To Document