Title :
New Scheduling Strategies and Hybrid Programming for a Parallel Right-looking Sparse LU Factorization Algorithm on Multicore Cluster Systems
Author :
Yamazaki, Ichitaro ; Li, Xiaoye S.
Author_Institution :
Comput. Res. Div., Lawrence Berkeley Nat. Lab., Berkeley, CA, USA
Abstract :
Parallel sparse LU factorization is a key computational kernel in the solution of a large-scale linear system of equations. In this paper, we propose two strategies to address some scalability issues of a factorization algorithm on modern HPC systems. The first strategy is at the algorithmic-level, we schedule independent tasks as soon as possible to reduce the idle time and the critical path of the algorithm. We demonstrate using thousands of cores that our new scheduling strategy reduces the runtime by nearly three-fold from that of a state-of-the-art pipelined factorization algorithm. The second strategy is at both programming- and architecture-levels, we incorporate light-weight Open MP threads in each MPI process to reduce both memory and time overheads of a pure MPI implementation on many core NUMA architectures. Using this hybrid programming paradigm, we obtain a significant reduction in memory usage while achieving a parallel efficiency competitive with that of a pure MPI paradigm. As a result, in comparison to a pure MPI paradigm which failed due to the per-core memory constraint, the hybrid paradigm could utilize more cores on each node and reduce the factorization time on the same number of nodes. We show extensive performance analysis of the new strategies using thousands of cores of the two leading HPC systems, a Cray-XE6 and an IBM iDataPlex.
Keywords :
application program interfaces; mathematics computing; matrix decomposition; message passing; multiprocessing systems; parallel architectures; scheduling; Cray-XE6; HPC system; IBM iDataPlex; MPI process; algorithmic-level; architecture-level; computational kernel; factorization time reduction; hybrid programming paradigm; independent task scheduling; large-scale linear system; light-weight Open MP thread; many core NUMA architecture; memory overhead reduction; memory usage reduction; multicore cluster system; parallel efficiency; parallel right-looking sparse LU factorization algorithm; per-core memory constraint; pipelined factorization algorithm; programming-level; runtime reduction; scalability issue; time overhead reduction; Linear systems; Memory management; Multicore processing; Processor scheduling; Programming; Scheduling;
Conference_Titel :
Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International
Conference_Location :
Shanghai
Print_ISBN :
978-1-4673-0975-2
DOI :
10.1109/IPDPS.2012.63