The following sections contain best practices for tuning the performance of PyFR. Note, however, that it is typically not worth pursuing the advice in this section until a simulation is working acceptably and generating the desired results.
If libxsmm is not available then PyFR will make use of GiMMiK for all matrix-matrix multiplications. Although functional, the performance is typically sub-par compared with that of libxsmm. As such libxsmm is highly recommended.
When running on an AVX-512 capable CPU Clang and GCC will, by default,
only make use of 256-bit vectors. Given that the kernels in PyFR
benefit meaningfully from longer vectors it is desirable to override
this behaviour. This can be accomplished through the
[backend-openmp] cflags = -mprefer-vector-width=512
Cores vs. threads¶
PyFR does not typically derive any benefit from SMT. As such the number of OpenMP threads should be chosen to be equal to the number of physical cores.
MPI processes vs. OpenMP threads¶
When using the OpenMP backend it is recommended to employ one MPI rank per NUMA zone. For most systems each socket represents its own NUMA zone. Thus, on a two socket system it is suggested to run PyFR with two MPI ranks, with each process being bound to a single socket. The specifics of how to accomplish this depend on both the job scheduler and MPI distribution.
PyFR is capable of taking advantage of CUDA-aware MPI. This enables
CUDA device pointers to be directly to passed MPI routines. Under the
right circumstances this can result in improved performance for
simulations which are near the strong scaling limit. Assuming
mpi4py has been built against an MPI distribution which is CUDA-aware
this functionality can be enabled through the
mpi-type key as:
[backend-cuda] mpi-type = cuda-aware
Note that if UCX is used as a transport, as is the case for recent builds of OpenMPI, it may be necessary to set:
$ export UCX_MEMTYPE_CACHE=n
METIS vs SCOTCH¶
The partitioning module in PyFR includes support for both METIS and SCOTCH. Both usually result in high-quality decompositions. However, for long running simulations on complex geometries it may be worth partitioning a grid with both and observing which decomposition performs best.
When running PyFR in parallel on mixed element grids it is necessary to take some additional care when partitioning the grid. A good domain decomposition is one where each partition contains the same amount of computational work. For grids with a single element type the amount of computational work is very well approximated by the number of elements assigned to a partition. Thus the goal is simply to ensure that all of the partitions have roughly the same number of elements. However, when considering mixed grids this relationship begins to break down since the computational cost of one element type can be appreciably more than that of another.
Thus in order to obtain a good decomposition it is necessary to assign a weight to each type of element in the domain. Element types which are more computationally intensive should be assigned a larger weight than those that are less intensive. Unfortunately, the relative cost of different element types depends on a variety of factors, including:
The polynomial order.
If anti-aliasing is enabled in the simulation, and if so, to what extent.
The hardware which the simulation will be run on.
Weights can be specified when partitioning the mesh as
-e shape:weight. For example, if on a particular system a
quadrilateral is found to be 50% more expensive than a triangle this
can be specified as:
pyfr partition -e quad:3 -e tri:2 ...
If precise profiling data is not available regarding the performance of each element type in a given configuration a helpful rule of thumb is to under-weight the dominant element type in the domain. For example, if a domain is 90% tetrahedra and 10% prisms then, absent any additional information about the relative performance of tetrahedra and prisms, a safe choice is to assume the prisms are appreciably more expensive than the tetrahedra.
PyFR incorporates support for parallel file I/O via HDF5 and will use it automatically where available. However, for this work several prerequisites must be satisfied:
HDF5 must be explicitly compiled with support for parallel I/O.
The mpi4py Python module must be compiled against the same MPI distribution as HDF5. A version mismatch here can result in subtle and difficult to diagnose errors.
The h5py Python module must be built with support for parallel I/O.
After completing this process it is highly recommended to verify everything is working by trying the h5py parallel hdf5 example.
The start-up time required by PyFR can be reduced by ensuring that
Python is compiled from source with profile guided optimisations (PGO)
which can be enabled by passing
--enable-optimizations to the
It is also important that NumPy be configured to use an optimized BLAS/LAPACK distribution. Further details can be found in the NumPy building from source guide.
If the point sampler plugin is being employed with a large number of sample points it is further recommended to install SciPy.