DocumentCode :
692926
Title :
Parallel reduction to Hessenberg form with Algorithm-Based Fault Tolerance
Author :
Yulu Jia ; Bosilca, George ; Luszczek, Piotr ; Dongarra, Jack J.
Author_Institution :
Univ. of Tennessee, Knoxville, TN, USA
fYear :
2013
fDate :
17-22 Nov. 2013
Firstpage :
1
Lastpage :
11
Abstract :
This paper studies the resilience of a two-sided factorization and presents a generic algorithm-based approach capable of making two-sided factorizations resilient. We establish the theoretical proof of the correctness and the numerical stability of the approach in the context of a Hessenberg Reduction (HR) and present the scalability and performance results of a practical implementation. Our method is a hybrid algorithm combining an Algorithm Based Fault Tolerance (ABFT) technique with diskless checkpointing to fully protect the data. We protect the trailing and the initial part of the matrix with checksums, and protect finished panels in the panel scope with diskless checkpoints. Compared with the original HR (the ScaLA-PACK PDGEHRD routine) our fault-tolerant algorithm introduces very little overhead, and maintains the same level of scalability. We prove that the overhead shows a decreasing trend as the size of the matrix or the size of the process grid increases.
Keywords :
checkpointing; mathematics computing; matrix decomposition; numerical stability; parallel algorithms; software fault tolerance; ABFT technique; HR; Hessenberg reduction; algorithm-based fault tolerance; data protection; diskless checkpointing; generic algorithm-based approach; hybrid algorithm; numerical stability; parallel reduction; two-sided factorization; Algorithm design and analysis; Checkpointing; Eigenvalues and eigenfunctions; Fault tolerance; Fault tolerant systems; Libraries; Prediction algorithms; Algorithm-based fault tolerance; Dense linear algebra; Hessenberg reduction; Parallel numerical libraries; ScaLAPACK;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
High Performance Computing, Networking, Storage and Analysis (SC), 2013 International Conference for
Conference_Location :
Denver, CO
Print_ISBN :
978-1-4503-2378-9
Type :
conf
DOI :
10.1145/2503210.2503249
Filename :
6877521
Link To Document :
بازگشت