Parthenon developer guide
(Kokkos) programming model
The following list contains a few overall design decision that are useful to keep in mind
Kokkos never hides copies, i.e., there are no hidden deep copies and all data transfer needs to be explicit.
Kernel launches (e.g., a parallel region) are non-blocking, i.e., the CPU continues to execute code following a kernel that is executed on a GPU.
Kokkos wrappers/abstractions
par_for
wrappers use inclusive bounds, i.e., the loop will include the last index givenParArrayND
arrays by default allocate on the device using default precision configured and come with a State that can be used to store additional metadata.ParArray#DRaw
directly map to KokkosViews
that are allocatedon device using default precision.
To create an array on the host with identical layout to the device array either use
auto arr_host = Kokkos::create_mirror(arr_dev);
to always create a new array even if the device is associated with the host (e.g., OpenMP) orauto arr_host = Kokkos::create_mirror_view(arr_dev);
to create an array on the host if the HostSpace != DeviceSpace or get another reference to arr_dev through arr_host if HostSpace == DeviceSpace
par_for
andKokkos::deep_copy
by default use the standard stream (on Cuda devices) and are discouraged from use. Usemb->par_for
andmb->deep_copy
instead wheremb
is aMeshBlock
(explanation: eachMeshBlock
has anExecutionSpace
, which may be changed at runtime, e.g., to a different stream, and the wrapper within aMeshBlock
offer transparent access to the parallel region/copy where theMeshBlock
’sExecutionSpace
is automatically used).The default loop pattern for the
mb->par_for
wrappers is defined at compile time through thePAR_LOOP_LAYOUT
andPAR_LOOP_INNER_LAYOUT
CMake
variables Note, that the 1D and 2D abstractions currently only wrapKokkos::RangePolicy
andKokkos::MDRangePolicy
, respectively, and, thus, are indepdent of thePAR_LOOP_LAYOUT
andPAR_LOOP_INNER_LAYOUT
configuration.DeviceAllocate
andDeviceCopy
return aunique_ptr
to an object allocated on device memory; the latter also copies data from a provided object in host memory. Theseunique_ptr
s automatically free the device memory when they go out of scope. This is especially useful for inheritance, but note that these allocation and deallocation calls are not performant and should be used minimally.
An arbitrary-dimensional wrapper for Kokkos::Views
is available as
ParArrayND
. See documentation here.
The wrappers par_for_outer
and par_for_inner
provide a nested
parallelism interface that is needed for managing memory cached in
tightly nested loops. The wrappers are documented
here.
View of Views
Special care needs to be taken when working with a View
of Views
.
To repeat the Kokkos documenation: Don’t use them
But if you have to (which is the case in some places inside Parthenon) then follow this pattern:
ParArray1DRaw<ParArray1D<Real>> view_of_pararrays(parthenon::ViewOfViewAlloc("myname"), 10);
The ViewOfViewAlloc
ensures that the Kokkos::SequentialHostInit
property is added,
which results in the (inner View
) deallocators being called on the host (rather than on
the device by default).
Also note the use of the “raw” ParArray1DRaw
, which directly maps to a Kokkos View
(that is required to process the allocation property as this interface is not exposed
in the more generic ParArrayND
).
Similarly, when you create a host mirror of said View
of View
add the additional
property for the same reason.
// explicit theoretical example -- don't use this
auto view_of_pararrays_h =
Kokkos::create_mirror_view(Kokkos::view_alloc(Kokkos::SequentialHostInit), view_of_pararrays);
// but instead use this interface provided by Parthenon:
auto view_of_pararrays_h = create_view_of_view_mirror(view_of_pararrays);
Note that the SequentialHostInit
was only added in Kokkos 4.4.1 (which is now the default in Parthenon).
The need for reductions within function handling MeshBlock
data
A task (often a function associated with a MeshBlock
) in the
original Athena++ code was guaranteed to only being executed using a
single CPU thread at any time. Thus, a reduction within a single
MeshBlock
(e.g., finding a minimum value in the MeshBlock
data
or filling a buffer using an incremented offset
pointer) did not
require special care. This is not true in Parthenon any more as a
parallel region may be handled by many (GPU/…) threads in parallel.
Therefore, a parallel_reduce
needs to be used instead of a
parallel_for
.
A strong hint where this is in order are places where a single variable is incremented within a kernel or where another reduction over MPI processes follows the preceding computations.
Parthenon provides abstractions that provide the ability to perform
reductions. In particular, reductions are available for loops from 1D to
5D, but do not include any nested parallel patterns. The interface for
the abstracted parallel_for
and parallel_reduce
executions are
nearly identical, with calls to par_reduce
having one additional
final argument compared to par_for
. The additional argument should
be a Kokkos Reducer, for example one of the built-in
Reducers
that ship with Kokkos.
Examples can be found in the advection example
- Real vmin = 1.0;
- Real vmax = 0.0;
- for (int k = kb.s; k <= kb.e; k++) {
- for (int j = jb.s; j <= jb.e; j++) {
- for (int i = ib.s; i <= ib.e; i++) {
- vmin = (v(k, j, i) < vmin ? v(k, j, i) : vmin);
- vmax = (v(k, j, i) > vmax ? v(k, j, i) : vmax);
- }
- }
- }
+ typename Kokkos::MinMax<Real>::value_type minmax;
+ pmb->par_reduce(
+ "advection check refinement", kb.s, kb.e, jb.s, jb.e, ib.s, ib.e,
+ KOKKOS_LAMBDA(int k, int j, int i,
+ typename Kokkos::MinMax<Real>::value_type &lminmax) {
+ lminmax.min_val = (v(k, j, i) < lminmax.min_val ? v(k, j, i) : lminmax.min_val);
+ lminmax.max_val = (v(k, j, i) > lminmax.max_val ? v(k, j, i) : lminmax.max_val);
+ },
+ Kokkos::MinMax<Real>(minmax));
or in the buffer packing functions
-void PackData(ParArrayND<T> &src, T *buf, int sn, int en, int si, int ei, int sj, int ej,
- int sk, int ek, int &offset) {
- for (int n = sn; n <= en; ++n) {
- for (int k = sk; k <= ek; k++) {
- for (int j = sj; j <= ej; j++) {
-#pragma omp simd
- for (int i = si; i <= ei; i++)
- buf[offset++] = src(n, k, j, i);
- }
- }
- }
+void PackData(ParArray4D<T> &src, ParArray1D<T> &buf, int sn, int en, int si, int ei,
+ int sj, int ej, int sk, int ek, int &offset, MeshBlock *pmb) {
+ int ni = ei + 1 - si;
+ int nj = ej + 1 - sj;
+ int nk = ek + 1 - sk;
+ int nn = en + 1 - sn;
+
+ pmb->par_for(
+ "PackData 4D", sn, en, sk, ek, sj, ej, si, ei,
+ KOKKOS_LAMBDA(const int n, const int k, const int j, const int i) {
+ buf(offset + i - si + ni * (j - sj + nj * (k - sk + nk * (n - sn)))) =
+ src(n, k, j, i);
+ });
+ offset += nn * nk * nj * ni;
+ return;
Note the explicit calculation of the offset within the kernel and the explicit increment of the offset by the full extent after the kernel.
FAQ
What’s the difference between
GetDim
andextent
?
ParArrayND
offer GetDim
to access the underlying array
dimension. Here, GetDim(0)
refers to the “first” dimension (e.g.,
x-direction). ParArray#D
s (with #
being 1, 2, 3, …) are direct
typedefs to Kokkos::View
s. Thus, a call to extent(0)
returns
the dimension along the first index. Given that ParArray#D
s are
constructed using reverse indices (note the k,j,i
order in accessing
elements), extent
and GetDim
using the same number usually have
different meaning.
auto myarr_nd = ParArrayND<Real>("myarr",nx4,nx3,nx2,nx1); // is logically a 6D array under the hood
ParArray4D<Real> myarr_fd = myarr_nd.Get<4>(); // extracts a 4D View with fixed dimensions
myarr_nd.GetDim(4); // = nx4
myarr_nd.GetDim(1); // = nx1
myarr_fd.extent(0); // = nx4
myarr_fd.extent(3); // = nx1
Where to allocate scratch pad memory (e.g., for temporary arrays that are shared between multiple function calls within a nested parallel region)?
Scratch pad memory is unique to each team can will be reused from a
larger pool of memory available for all teams. However, this allocation
tracking only works if the ScratchPadView
s are constructed within
the outer parallel regions. Therefore, allocating/constructing
ScratchPadView
s within functions that are called in the outer
parallel region will lead to an overallocation of memory (and likely
result in a segfault or out of memory exceptions).
Where to use barriers/fences?
As mentioned above, kernel launches are non-blocking and kernel
executions are asynchronous (potentially handles by the execution space
scheduler). Thus, barriers are required where the following code
requires the successful execution of all kernels scheduled. There are
three obvious places where this applies: 1. Around MPI calls, e.g.,
sending a buffer should first be done when the kernel filling the buffer
has finished. In order for the parallel execution to continue (e.g.,
multiple MeshBlocks
in multiple device streams) the fence
function of the corresponding execution space needs to be used, i.e.,
pmb->exec_space.fence();
and not the global fence
(Kokkos::fence();
). 2. Within a nested parallel regions when using
scratch space. The threads within a team are independent and thus a
member.team_barrier()
is required between filling the scratch space
and (re)using it. 3. When collecting the results of a parallel reduction
on a View
. Usually parallel_reduce
regions are blocking if the
result of the reduction is a host variable (more precisely, of scalar
type), e.g., a simple double
(or here a Real
). If the result of
the reduction is a View
then the region is non-blocking and other
places in the code should ensure that all reductions are finished (e.g.,
calculating the minimum timestep over all MeshBlocks
of a single
process. This also applies to hierarchical parallelism, i.e., when an
inner parallel_reduce
reduces to a ScratchPadView
then a
team_barrier()
is required.
Why do I need to redefine variables preceding a parallel region?
The KOKKOS_LAMBDA
macro expands into a capture by value [=]
(plus host/device annotations). Thus, class member variables are not
captured directly, but rather this
is, see also a related
issue on GitHub. A
redefinition, e.g., auto coarse_coords = this->coarse_coords;
ensures that the desired object is properly captured and available
within the kernel(/parallel region).
What does
"error: The enclosing parent function ("...") for an extended __host__ __device__ lambda cannot have private or protected access within its class"
mean?
This is a current Cuda limitation for extended device lambdas, see Cuda programming guide, and can be “fixed”/addressed by making the function public.