Publication
Civil-Comp Proceedings
Paper

Mesh renumbering methods and performance with OpenMP/MPI in code saturne

Abstract

The scale of computational fluid dynamics (CFD) simulation problems is rapidly increasing as a result of the requirements for higher spatial resolution, varied turbulence models, and more detailed physics. As is the case with many CFD Navier-Stokes tools, EDF's Code-Saturne which is also one of the two CFD software packages of the PRACE benchmark, is parallelized using domain partitioning and MPI. On large systems with thousands of compute nodes, even with simulations employing multi-billion cell meshes a pure MPI approach will not able to fully take advantage of the multiple levels of parallelism and the steady increase in the number of cores per processor. To tackle this problem the most popular approach is to introduce a hybrid MPI/OpenMP approach. Code-Saturne implements a three-dimensional general finite volume solver with conformal and non-conformal meshes. The computation time is dominated by the linear equation solvers, mainly for the pressure and to a lesser degree by gradient reconstructions. The thread-level parallelism was mainly applied on computational loops which iterate over the cells or faces in the cell-centred formulation. A general loop transformation was implemented to allow a wide range of methods to control memory indirect addressing conflicts between threads, while minimizing code changes. In this paper different mesh renumbering algorithms are presented to generate threads (multipass approach with METIS, SCOTCH partitioning or space filling Morton curves, Cuthill McKee approach), while exploiting communication overlapping. Performance, scalability and comparison results are presented on an Intel x86 cluster (with three generations of Intel Xeon processor: Westmere, Ivy Bridge and Haswell) and IBM Blue Gene/Q systems. A very significant part of the total execution time is spent in sparse matrix-vector products. It is shown that this product can behave as a stream kernel benchmark and therefore depends on the memory system performance. It is pointed out that significant performance degradation occurs per core depending on the number of cores used per node. Results on several Intel Xeon generations are provided as well as hardware counter analysis.

Date

Publication

Civil-Comp Proceedings

Authors

Topics

Share