Performance Tuning
The following sections contain best practices for tuning the performance of PyFR. Note, however, that it is typically not worth pursuing the advice in this section until a simulation is working acceptably and generating the desired results.
OpenMP Backend
AVX-512
When running on an AVX-512 capable CPU Clang and GCC will, by default,
only make use of 256-bit vectors. Given that the kernels in PyFR
benefit meaningfully from longer vectors it is desirable to override
this behaviour. This can be accomplished through the cflags
key
as:
[backend-openmp]
cflags = -mprefer-vector-width=512
Cores vs. threads
PyFR does not typically derive any benefit from SMT. As such the number of OpenMP threads should be chosen to be equal to the number of physical cores.
Loop Scheduling
By default PyFR employs static scheduling for loops, with work being split evenly across cores. For systems with a single type of core this is usually the right choice. However, on heterogeneous systems it typically results in load imbalance. Here, it can be beneficial to experiment with the guided and dynamic loop schedules as:
[backend-openmp]
schedule = dynamic, 5
MPI processes vs. OpenMP threads
When using the OpenMP backend it is recommended to employ one MPI rank per NUMA zone. For most systems each socket represents its own NUMA zone. Thus, on a two socket system it is suggested to run PyFR with two MPI ranks, with each process being bound to a single socket. The specifics of how to accomplish this depend on both the job scheduler and MPI distribution.
Asynchronous MPI progression
The parallel scalability of the OpenMP backend depends heavily on MPI having support for asynchronous progression; that is to say the ability for non-blocking send and receive requests to complete without the need for the host application to make explicit calls into MPI routines. A lack of support for asynchronous progression prevents PyFR from being able to overlap computation with communication.
CUDA Backend
CUDA-aware MPI
PyFR is capable of taking advantage of CUDA-aware MPI. This enables
CUDA device pointers to be directly to passed MPI routines. Under the
right circumstances this can result in improved performance for
simulations which are near the strong scaling limit. Assuming
mpi4py has been built against an MPI distribution which is CUDA-aware
this functionality can be enabled through the mpi-type
key as:
[backend-cuda]
mpi-type = cuda-aware
HIP Backend
HIP-aware MPI
PyFR is capable of taking advantage of HIP-aware MPI. This enables
HIP device pointers to be directly to passed MPI routines. Under the
right circumstances this can result in improved performance for
simulations which are near the strong scaling limit. Assuming
mpi4py has been built against an MPI distribution which is HIP-aware
this functionality can be enabled through the mpi-type
key as:
[backend-hip]
mpi-type = hip-aware
Partitioning
METIS vs SCOTCH
The partitioning module in PyFR includes support for both METIS and SCOTCH. Both usually result in high-quality decompositions. However, for long running simulations on complex geometries it may be worth partitioning a grid with both and observing which decomposition performs best.
Mixed grids
When running PyFR in parallel on mixed element grids it is necessary to take some additional care when partitioning the grid. A good domain decomposition is one where each partition contains the same amount of computational work. For grids with a single element type the amount of computational work is very well approximated by the number of elements assigned to a partition. Thus the goal is simply to ensure that all of the partitions have roughly the same number of elements. However, when considering mixed grids this relationship begins to break down since the computational cost of one element type can be appreciably more than that of another.
There are two main solutions to this problem. The first is to require that each partition contain the same number of elements of each type. For example, if partitioning a mesh with 500 quadrilaterals and 1,500 triangles into two parts, then a sensible goal is to aim for each domain to have 250 quadrilaterals and 750 triangles. Irrespective of what the relative performance differential between the element types is, both partitions will have near identical amounts of work. In PyFR this is known as the balanced approach and can be requested via:
pyfr partition -e balanced ...
This approach typically works well when the number of partitions is small. However, for larger partition counts it can become difficult to achieve such a balance whilst simultaneously minimising communication volume. Thus, in order to obtain a good decomposition a secondary approach is required in which each type of element in the domain is assigned a weight. Element types which are more computationally intensive are assigned a larger weight than those that are less intensive. Through this mechanism the total cost of each partition can remain balanced. Unfortunately, the relative cost of different element types depends on a variety of factors, including:
The polynomial order.
If anti-aliasing is enabled in the simulation, and if so, to what extent.
The hardware which the simulation will be run on.
Weights can be specified when partitioning the mesh as
-e shape:weight
. For example, if on a particular system a
quadrilateral is found to be 50% more expensive than a triangle this
can be specified as:
pyfr partition -e quad:3 -e tri:2 ...
If precise profiling data is not available regarding the performance of each element type in a given configuration a helpful rule of thumb is to under-weight the dominant element type in the domain. For example, if a domain is 90% tetrahedra and 10% prisms then, absent any additional information about the relative performance of tetrahedra and prisms, a safe choice is to assume the prisms are appreciably more expensive than the tetrahedra.
Detecting load imbalances
PyFR includes code for monitoring the amount of time each rank spends waiting for MPI transfers to complete. This can be used, among other things, to detect load imbalances. Such imbalances are typically observed on mixed-element grids with an incorrect weighting factor. Wait time tracking can be enabled as:
[backend]
collect-wait-times = true
with the resulting statistics being recorded in the
[backend-wait-times]
section of the /stats
object which is
included in all PyFR solution files. This can be extracted as:
h5dump -d /stats -b --output=stats.ini soln.pyfrs
Note that the number of graphs depends on the system, and not all
graphs initiate MPI requests. The average amount of time each rank
spends waiting for MPI requests per right hand side evaluation can be
obtained by vertically summing all of the -median
fields together.
There exists an inverse relationship between the amount of computational work a rank has to perform and the amount of time it spends waiting for MPI requests to complete. Hence, ranks which spend comparatively less time waiting than their peers are likely to be overloaded, whereas those which spend comparatively more time waiting are likely to be underloaded. This information can then be used to explicitly re-weight the partitions and/or the per-element weights.
Scaling
The general recommendation when running PyFR in parallel is to aim for a parallel efficiency of \(\epsilon \simeq 0.8\) with the parallel efficiency being defined as:
where \(N\) is the number of ranks, \(T_1\) is the simulation time with one rank, and \(T_N\) is the simulation time with \(N\) ranks. This represents a reasonable trade-off between the overall time-to-solution and efficient resource utilisation.
Parallel I/O
PyFR incorporates support for parallel file I/O via HDF5 and will use it automatically where available. However, for this work several prerequisites must be satisfied:
HDF5 must be explicitly compiled with support for parallel I/O.
The mpi4py Python module must be compiled against the same MPI distribution as HDF5. A version mismatch here can result in subtle and difficult to diagnose errors.
The h5py Python module must be built with support for parallel I/O.
After completing this process it is highly recommended to verify everything is working by trying the h5py parallel HDF5 example.
Plugins
A common source of performance issues is running plugins too
frequently. PyFR records the amount of time spent in plugins in the
[solver-time-integrator]
section of the /stats
object which is
included in all PyFR solution files. This can be extracted as:
h5dump -d /stats -b --output=stats.ini soln.pyfrs
Given the time steps taken by PyFR are typically much smaller than those associated with the underlying physics there is seldom any benefit to running integration and/or time average accumulation plugins more frequently than once every 50 steps. Further, when running with adaptive time stepping there is no need to run the NaN check plugin. For simulations with fixed time steps, it is not recommended to run the NaN check plugin more frequently than once every 10 steps.
Start-up Time
The start-up time required by PyFR can be reduced by ensuring that
Python is compiled from source with profile guided optimisations (PGO)
which can be enabled by passing --enable-optimizations --with-lto
to the configure
script.
It is also important that NumPy be configured to use an optimised BLAS/LAPACK distribution. Further details can be found in the NumPy building from source guide.