Accelerating subsurface transport simulation on heterogeneous clusters

Author

Villa, Oreste ; Gawande, Nitin ; Tumeo, Antonino

Author_Institution

NVIDIA, Santa Clara, CA, USA

fYear

2013

fDate

23-27 Sept. 2013

Firstpage

Lastpage

Abstract

Reactive transport numerical models simulate chemical and microbiological reactions that occur along a flow-path. These models have to compute reactions for a large number of locations. They solve the set of ordinary differential equations (ODEs) that describes the reaction for each location through the Newton-Raphson technique. This technique involves computing a Jacobian matrix and a residual vector for each set of equations, and then solving iteratively the linearized system by performing Gaussian Elimination and LU decomposition until convergence. STOMP, a well known subsurface flow simulation tool, employs matrices with sizes in the order of 100×100 elements and, for numerical accuracy, LU factorization with full pivoting instead of the faster partial pivoting. Modern high performance computing systems are heterogeneous machines, whose nodes integrate both CPUs and GPUs, and expose unprecedented amounts of parallelism. To exploit all their computational power, applications must use both the types of processing elements. For the case of subsurface flow simulation, this mainly requires implementing efficient batched LU-based solvers and identifying efficient solutions for enabling load balancing among the different processors of the system. In this paper we discuss two approaches that allow scaling STOMP´s performance on heterogeneous clusters. We initially identify the challenges in implementing batched LU-based solvers for small matrices on GPUs, and propose an implementation that fulfills STOMP´s requirements. We compare this implementation to other existing solutions. Then, we combine the batched GPU solver with an OpenMP-based CPU solver, and present an adaptive load balancer that dynamically distributes the linear systems to solve between the two components inside a node. We show how these approaches, integrated into the full application, provide speed ups from 6 to 7 times on large problems, executed on up to 16 nodes of a cluster with two AMD O- teron 6272 and a Tesla M2090 per node.

Keywords

Jacobian matrices; Newton-Raphson method; computational fluid dynamics; differential equations; flow simulation; graphics processing units; mechanical engineering computing; parallel processing; resource allocation; AMD Opteron 6272; GPU; Gaussian elimination; Jacobian matrix; LU decomposition; Newton-Raphson technique; ODE; OpenMP-based CPU solver; STOMP tool; Tesla M2090; adaptive load balancer; batched LU-based solvers; chemical reactions; full pivoting; graphics processing unit; heterogeneous clusters; heterogeneous machines; high performance computing systems; microbiological reactions; ordinary differential equations; parallelism; reactive transport numerical models; residual vector; subsurface flow simulation tool; subsurface transport simulation; Convergence; Jacobian matrices; Laboratories; Linear systems; Matrix decomposition; Numerical models; Sparse matrices;

fLanguage

English

Publisher

ieee

Conference_Titel

Cluster Computing (CLUSTER), 2013 IEEE International Conference on

Conference_Location

Indianapolis, IN

Type

conf

DOI

10.1109/CLUSTER.2013.6702656

Filename

6702656

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=668153