.. _nested par for: Nested Parallelism ================== The loop wrappers documented here abstracts a hierarchical parallelism model, which allows more fine grain control on core level vs. thread and vector level parallelism and allows explicitly defined caches for tightly nested loops. These wrappers provide a simplified interface of the `hierarchical parallelism in Kokkos `__ For an example of the nested parallel wrappers in use, see `the unit test `__ ``par_for_outer`` ----------------- ``par_for_outer`` abstracts the team or multicore parallelism for outer loops. Inside the loop body, in the lambda function provided to ``par_for_outer``, synchronization and memory sharing between the threads in the single team is possible through the ``member_type`` team member type from Kokkos. The bytes of scratch memory for cache needed by a single team is specified via the ``scratch_size_in_bytes``, which needs be computed using ``ScratchPadXD::shmem_size``. The argument ``scratch_level`` defines where the scratch memory should be allocated. For CUDA GPUs, ``scratch_level=0`` allocates the cache in the faster by smaller ``shared`` memory and ``scratch_level=1`` allocates the cache in the slower but larger ``global`` or on device RAM memory. For CPUs, currently ``scratch_level`` makes no difference. Note that every thread within a team will execute code inside a ``par_for_outer`` but outside of the ``par_for_inner``. ``par_for_inner`` ----------------- ``par_for_inner`` abstracts the thread and vector level parallelism of compute units within a single team or core. Work defined through a ``par_for_inner`` will be distributed between individual threads and vector lanes within the team. ``ScratchPadXD`` ---------------- Data type for memory in scratch pad/cache memory. Use ``ScratchPadXD::shmem_size``, which is documented in `the Kokkos documentation `__ for determining scratch pad memory needs before kernel launch. On Barriers --------------------- In order to ensure that individual threads of a team are synchronized always call ``team_member.team_barrier();`` after an ``par_for_inner`` if the following execution depends on the results of the ``par_for_inner``. This pertains, for example, to filling a ``ScratchPadXD`` array in one ``par_inner_for`` and using the scratch array in the next one, see `the unit test `__ for sample usage. In addition, the entry to a ``par_for_inner`` does **not** imply a barrier and not all threads of a team may even enter an inner parallel region (e.g., if there is not enough work – read indices – for all team members). This can lead to unintended side-effects when all team member write to common variables, see this `code `__ for an example. Cmake Options ------------- ``PAR_LOOP_INNER_LAYOUT`` controls how the inner loop is implemented. ``PAR_LOOP_INNER_LAYOUT=TVR_INNER_LOOP`` uses the Kokkos ``TeamVectorRange``, which merges ``TeamThreadRange`` and ``ThreadVectorRange`` into one loop, to distribute work between threads. ``PAR_LOOP_INNER_LAYOUT=TVR_INNER_LOOP`` is the only option supported for CUDA since the Kokkos loops are required for parallelization on GPUs. ``PAR_LOOP_INNER_LAYOUT=SIMDFOR_INNER_LOOP`` uses a ``for`` loop with a ``#pragma omp simd`` to vectorize the loop, which typically gives better vectorization loops than ``PAR_LOOP_INNER_LAYOUT=TVR_INNER_LOOP`` on CPUs and so is the default on CPUs. Performance Considerations --------------------------- Hierarchical parallelism can produce very performant code, but a deeper awareness of how hardware is mapped to threads is required to get optimal performance. Here we list a few strategies/considerations. * On CPU, with `SIMDFOR_INNER_LOOP` you may have trouble vectorizing unless you help the compiler along. One way to do so is to work with raw pointers to contiguous memory, rather than working with views and strides. Even for stencil ops, if you can pull out pointers that represent the different points on the stencil, this can help with vectorization. * Similarly on CPUs, due to the cost of starting up a vector op, vectorization will only be a performance win if there's enough work in the inner loop. A minimum of 16 points is required for the op to vectorize at all. Experience shows, however, that at least 64 is really required to see big wins. One strategy for providing enough vector cells in the inner loop is to do a 1D ``SIMDFOR`` inner loop but combine the ``j`` and ``i`` indices by simply looping over the contiguous memory in a rasterized plane on a block. * On GPUs, the outer loop typically maps to blocks, while the inner maps to threads. To see good performance, you must both provide enough work in the inner loop to create enough threads to fill in CUDA terms a streaming multiprocessor (SM, equivalent to a Compute Unit or CU on AMD GPUs) with multiple warps (or wavefronts for AMD) to take advantage of pipelining and enough work in the outer loop to create enough blocks to fill all SMs on the GPU divided by the number of simultaneous streams. The number of warps in flight on the inner loop per SM (which is related to "occupancy") will depend positively on length of the inner loop and negatively on higher shared memory usage (scratch pad memory in Kokkos parlance and Local Data Share or LDS on AMD GPUs) and higher register usage. Note that the number of SMs and the available shared memory and registers per SM will vary between GPU architectures and especially between GPU vendors. IndexSplit ------------- To balance the CPU vs GPU hardware considerations of hierarchical parallelism, ``Parthenon`` provides a utility, the ``IndexSplit`` class, defined in the ``utils/index_split.hpp`` header file and available in ```` in the ``parthenon::package::prelude`` namespace. In our experience ``IndexSplit`` is most beneficial when working with small meshblocks on CPUs, especially in two dimensions. For small blocks, we want vectorized operations over contiguous memory for our innermost loop, but we want that loop to contain enough work for, e.g., vector ops to function. We have often found that the optimal split is to fuse j, and i into the inner loop and use k and blocks in the outer loop. The ``IndexSplit`` class can be constructed as .. code:: cpp IndexSplit(MeshData md, IndexDomain domain, const int nkp, const int njp); where here ``md`` is a ``MeshData`` object on which you want to operate. ``domain`` specifies where in the ``MeshBlock`` you wish to operate, for example ``IndexDomain::Interior``. ``nkp`` and ``njp`` are the number of points in ``X3`` and ``X2`` respectively that are in the outer loop. All remaining points are in the inner loop; each team will iterate over multiple `k` and/or `j` indices to cover the specified `k/j` range. Typically ``MeshBlock`` index in the pack is also assumed to be in the outer loop. ``nkp`` and ``njp`` also accept special flags ``IndexSplit::all_outer`` and ``IndexSplit::no_outer``, which specify that all and none of the indices in that direction should be in the outer loop. .. warning:: Note that, in contrast to ``njp``, ``nkp`` points in the ``k``-direction are not included in the innermost loop bounds. You must loop over ``k`` by hand inside the outer loop body. A second constructor alternatively sets the range for ``X3``, ``X2``, and ``X1`` explicitly: .. code:: cpp IndexSplit(MeshData *md, const IndexRange &kb, const IndexRange &jb, const IndexRange &ib, const int nkp, const int njp); where here ``kb``, ``jb``, and ``ib`` specify the starting and ending indices for ``X3``, ``X2``, and ``X1`` respecively. .. warning:: Note that, at this time, ``IndexSplit`` doesn't know about face-centered or edge-centered data. To use ``IndexSplit`` with, e.g., face-centered data, set the input ``IndexRange`` quantities to match the shape for the face-centered data (e.g., with the appropriate offsets). An ``IndexSplit`` object is typically used as: .. code:: cpp using namespace parthenon::package::prelude; using parthenon::ScratchPad1D; using parthenon::IndexSplit; using parthenon::par_for_outer; using parthenon::par_for_inner; using parthenon::team_mbr_t; // Initialize index split object IndexSplit idx_sp(md, IndexDomain::interior, nkp, njp); // Request maximum size in i and j in the inner loop, for scratch const int Ni = idx_sp.get_max_ni(); const int Nj = idx_sp = get_max_nj(); const in tNmax = Ni * Nj; // single scratch array for i,j auto scratch_size = ScratchPad1D::shmem_size(Nmax); constexpr int scratch_level = 0; // Par for par_for_outer( DEFAULT_OUTER_LOOP_PATTERN, "KernalOuter", DevExecSpace(), scratch_size, scratch_level, 0, nblocks - 1, 0, idx_sp.outer_size() - 1, KOKKOS_LAMBDA(team_mbr_t member, const int b, const int outer_idx) { ScratchPad1D scratch(member.team_scratch(scratch_level), Nmax); // Get index ranges. Note they depend on where we are in the outer index! // These give us a sense for where we are in k,j space const auto krange = idx_sp.GetBoundsK(outer_idx); const auto jrange = idx_sp.GetBoundsJ(outer_idx); // This is the loop of contiguous inner memory. May contain i and j! const auto flattened_inner_ijrange = idx_sp.GetInnerBounds(jrange); const int inner_size = flattened_inner_ijrange.e - flattened_inner_ijrange.s + 1; // Whatever part of k is not in the outer loop can be looped over // with a normal for loop here for (int k = krange.s; k <= krange.e; ++k) { // pull out a pointer some variable in some pack. Note // we pick the 0th index of i at k and jrange.s Real *var = &pack(b, ivar, k, jrange.s, flattened_inner_ijrange.s); // Do something with the pointer in the inner loop. par_for_inner(DEFAULT_INNER_LOOP_PATTERN, member, 0, flattened_inner_size, [&](const int i) { foo(var[i]); }); } });