Parthenon developer guide
(Kokkos) programming model
The following list contains a few overall design decision that are useful to keep in mind
- Kokkos never hides copies, i.e., there are no hidden deep copies and all data transfer needs to be explicit. 
- Kernel launches (e.g., a parallel region) are non-blocking, i.e., the CPU continues to execute code following a kernel that is executed on a GPU. 
Kokkos wrappers/abstractions
- par_forwrappers use inclusive bounds, i.e., the loop will include the last index given
- ParArrayNDarrays by default allocate on the device using default precision configured and come with a State that can be used to store additional metadata.
- ParArray#DRawdirectly map to Kokkos- Viewsthat are allocated
- on device using default precision. 
 
- To create an array on the host with identical layout to the device array either use - auto arr_host = Kokkos::create_mirror(arr_dev);to always create a new array even if the device is associated with the host (e.g., OpenMP) or
- auto arr_host = Kokkos::create_mirror_view(arr_dev);to create an array on the host if the HostSpace != DeviceSpace or get another reference to arr_dev through arr_host if HostSpace == DeviceSpace
 
- par_forand- Kokkos::deep_copyby default use the standard stream (on Cuda devices) and are discouraged from use. Use- mb->par_forand- mb->deep_copyinstead where- mbis a- MeshBlock(explanation: each- MeshBlockhas an- ExecutionSpace, which may be changed at runtime, e.g., to a different stream, and the wrapper within a- MeshBlockoffer transparent access to the parallel region/copy where the- MeshBlock’s- ExecutionSpaceis automatically used).
- The default loop pattern for the - mb->par_forwrappers is defined at compile time through the- PAR_LOOP_LAYOUTand- PAR_LOOP_INNER_LAYOUT- CMakevariables Note, that the 1D and 2D abstractions currently only wrap- Kokkos::RangePolicyand- Kokkos::MDRangePolicy, respectively, and, thus, are indepdent of the- PAR_LOOP_LAYOUTand- PAR_LOOP_INNER_LAYOUTconfiguration.
- DeviceAllocateand- DeviceCopyreturn a- unique_ptrto an object allocated on device memory; the latter also copies data from a provided object in host memory. These- unique_ptrs automatically free the device memory when they go out of scope. This is especially useful for inheritance, but note that these allocation and deallocation calls are not performant and should be used minimally.
An arbitrary-dimensional wrapper for Kokkos::Views is available as
ParArrayND. See documentation here.
The wrappers par_for_outer and par_for_inner provide a nested
parallelism interface that is needed for managing memory cached in
tightly nested loops. The wrappers are documented
here.
View of Views
Special care needs to be taken when working with a View of Views.
To repeat the Kokkos documenation: Don’t use them
But if you have to (which is the case in some places inside Parthenon) then follow this pattern:
ParArray1DRaw<ParArray1D<Real>> view_of_pararrays(parthenon::ViewOfViewAlloc("myname"), 10);
The ViewOfViewAlloc ensures that the Kokkos::SequentialHostInit property is added,
which results in the (inner View ) deallocators being called on the host (rather than on
the device by default).
Also note the use of the “raw” ParArray1DRaw, which directly maps to a Kokkos View
(that is required to process the allocation property as this interface is not exposed
in the more generic ParArrayND).
Similarly, when you create a host mirror of said View of View add the additional
property for the same reason.
// explicit theoretical example -- don't use this
auto view_of_pararrays_h =
     Kokkos::create_mirror_view(Kokkos::view_alloc(Kokkos::SequentialHostInit), view_of_pararrays);
// but instead use this interface provided by Parthenon:
auto view_of_pararrays_h = create_view_of_view_mirror(view_of_pararrays);
Note that the SequentialHostInit was only added in Kokkos 4.4.1 (which is now the default in Parthenon).
The need for reductions within function handling MeshBlock data
A task (often a function associated with a MeshBlock) in the
original Athena++ code was guaranteed to only being executed using a
single CPU thread at any time. Thus, a reduction within a single
MeshBlock(e.g., finding a minimum value in the MeshBlock data
or filling a buffer using an incremented offset pointer) did not
require special care. This is not true in Parthenon any more as a
parallel region may be handled by many (GPU/…) threads in parallel.
Therefore, a parallel_reduce needs to be used instead of a
parallel_for.
A strong hint where this is in order are places where a single variable is incremented within a kernel or where another reduction over MPI processes follows the preceding computations.
Parthenon provides abstractions that provide the ability to perform
reductions. In particular, reductions are available for loops from 1D to
5D, but do not include any nested parallel patterns. The interface for
the abstracted parallel_for and parallel_reduce executions are
nearly identical, with calls to par_reduce having one additional
final argument compared to par_for. The additional argument should
be a Kokkos Reducer, for example one of the built-in
Reducers
that ship with Kokkos.
Examples can be found in the advection example
-  Real vmin = 1.0;
-  Real vmax = 0.0;
-  for (int k = kb.s; k <= kb.e; k++) {
-    for (int j = jb.s; j <= jb.e; j++) {
-      for (int i = ib.s; i <= ib.e; i++) {
-        vmin = (v(k, j, i) < vmin ? v(k, j, i) : vmin);
-        vmax = (v(k, j, i) > vmax ? v(k, j, i) : vmax);
-      }
-    }
-  }
+ typename Kokkos::MinMax<Real>::value_type minmax;
+ pmb->par_reduce(
+     "advection check refinement", kb.s, kb.e, jb.s, jb.e, ib.s, ib.e,
+     KOKKOS_LAMBDA(int k, int j, int i,
+                   typename Kokkos::MinMax<Real>::value_type &lminmax) {
+       lminmax.min_val = (v(k, j, i) < lminmax.min_val ? v(k, j, i) : lminmax.min_val);
+       lminmax.max_val = (v(k, j, i) > lminmax.max_val ? v(k, j, i) : lminmax.max_val);
+     },
+     Kokkos::MinMax<Real>(minmax));
or in the buffer packing functions
-void PackData(ParArrayND<T> &src, T *buf, int sn, int en, int si, int ei, int sj, int ej,
-              int sk, int ek, int &offset) {
-  for (int n = sn; n <= en; ++n) {
-    for (int k = sk; k <= ek; k++) {
-      for (int j = sj; j <= ej; j++) {
-#pragma omp simd
-        for (int i = si; i <= ei; i++)
-          buf[offset++] = src(n, k, j, i);
-      }
-    }
-  }
+void PackData(ParArray4D<T> &src, ParArray1D<T> &buf, int sn, int en, int si, int ei,
+              int sj, int ej, int sk, int ek, int &offset, MeshBlock *pmb) {
+  int ni = ei + 1 - si;
+  int nj = ej + 1 - sj;
+  int nk = ek + 1 - sk;
+  int nn = en + 1 - sn;
+
+  pmb->par_for(
+      "PackData 4D", sn, en, sk, ek, sj, ej, si, ei,
+      KOKKOS_LAMBDA(const int n, const int k, const int j, const int i) {
+        buf(offset + i - si + ni * (j - sj + nj * (k - sk + nk * (n - sn)))) =
+            src(n, k, j, i);
+      });
+  offset += nn * nk * nj * ni;
+  return;
Note the explicit calculation of the offset within the kernel and the explicit increment of the offset by the full extent after the kernel.
FAQ
- What’s the difference between - GetDimand- extent?
ParArrayND offer GetDim to access the underlying array
dimension. Here, GetDim(0) refers to the “first” dimension (e.g.,
x-direction). ParArray#Ds (with # being 1, 2, 3, …) are direct
typedefs to Kokkos::Views. Thus, a call to extent(0) returns
the dimension along the first index. Given that ParArray#Ds are
constructed using reverse indices (note the k,j,i order in accessing
elements), extent and GetDim using the same number usually have
different meaning.
auto myarr_nd = ParArrayND<Real>("myarr",nx4,nx3,nx2,nx1); // is logically a 6D array under the hood
ParArray4D<Real> myarr_fd = myarr_nd.Get<4>(); // extracts a 4D View with fixed dimensions
myarr_nd.GetDim(4); // = nx4
myarr_nd.GetDim(1); // = nx1
myarr_fd.extent(0); // = nx4
myarr_fd.extent(3); // = nx1
- Where to allocate scratch pad memory (e.g., for temporary arrays that are shared between multiple function calls within a nested parallel region)? 
Scratch pad memory is unique to each team can will be reused from a
larger pool of memory available for all teams. However, this allocation
tracking only works if the ScratchPadViews are constructed within
the outer parallel regions. Therefore, allocating/constructing
ScratchPadViews within functions that are called in the outer
parallel region will lead to an overallocation of memory (and likely
result in a segfault or out of memory exceptions).
- Where to use barriers/fences? 
As mentioned above, kernel launches are non-blocking and kernel
executions are asynchronous (potentially handles by the execution space
scheduler). Thus, barriers are required where the following code
requires the successful execution of all kernels scheduled. There are
three obvious places where this applies: 1. Around MPI calls, e.g.,
sending a buffer should first be done when the kernel filling the buffer
has finished. In order for the parallel execution to continue (e.g.,
multiple MeshBlocks in multiple device streams) the fence
function of the corresponding execution space needs to be used, i.e.,
pmb->exec_space.fence(); and not the global fence
(Kokkos::fence();). 2. Within a nested parallel regions when using
scratch space. The threads within a team are independent and thus a
member.team_barrier() is required between filling the scratch space
and (re)using it. 3. When collecting the results of a parallel reduction
on a View. Usually parallel_reduce regions are blocking if the
result of the reduction is a host variable (more precisely, of scalar
type), e.g., a simple double (or here a Real). If the result of
the reduction is a View then the region is non-blocking and other
places in the code should ensure that all reductions are finished (e.g.,
calculating the minimum timestep over all MeshBlocks of a single
process. This also applies to hierarchical parallelism, i.e., when an
inner parallel_reduce reduces to a ScratchPadView then a
team_barrier() is required.
- Why do I need to redefine variables preceding a parallel region? 
The KOKKOS_LAMBDA macro expands into a capture by value [=]
(plus host/device annotations). Thus, class member variables are not
captured directly, but rather this is, see also a related
issue on GitHub. A
redefinition, e.g., auto coarse_coords = this->coarse_coords;
ensures that the desired object is properly captured and available
within the kernel(/parallel region).
- What does - "error: The enclosing parent function ("...") for an extended __host__ __device__ lambda cannot have private or protected access within its class"mean?
This is a current Cuda limitation for extended device lambdas, see Cuda programming guide, and can be “fixed”/addressed by making the function public.