Nested Parallelism
The loop wrappers documented here abstracts a hierarchical parallelism model, which allows more fine grain control on core level vs. thread and vector level parallelism and allows explicitly defined caches for tightly nested loops. These wrappers provide a simplified interface of the hierarchical parallelism in Kokkos
For an example of the nested parallel wrappers in use, see the unit test
par_for_outer
par_for_outer
abstracts the team or multicore parallelism for outer
loops. Inside the loop body, in the lambda function provided to
par_for_outer
, synchronization and memory sharing between the
threads in the single team is possible through the member_type
team
member type from Kokkos.
The bytes of scratch memory for cache needed by a single team is
specified via the scratch_size_in_bytes
, which needs be computed
using ScratchPadXD::shmem_size
.
The argument scratch_level
defines where the scratch memory should
be allocated. For CUDA GPUs, scratch_level=0
allocates the cache in
the faster by smaller shared
memory and scratch_level=1
allocates the cache in the slower but larger global
or on device RAM
memory. For CPUs, currently scratch_level
makes no difference.
Note that every thread within a team will execute code inside a
par_for_outer
but outside of the par_for_inner
.
par_for_inner
par_for_inner
abstracts the thread and vector level parallelism of
compute units within a single team or core. Work defined through a
par_for_inner
will be distributed between individual threads and
vector lanes within the team.
ScratchPadXD
Data type for memory in scratch pad/cache memory. Use
ScratchPadXD::shmem_size
, which is documented in the Kokkos
documentation
for determining scratch pad memory needs before kernel launch.
On Barriers
In order to ensure that individual threads of a team are synchronized
always call team_member.team_barrier();
after an par_for_inner
if the following execution depends on the results of the
par_for_inner
. This pertains, for example, to filling a
ScratchPadXD
array in one par_inner_for
and using the scratch
array in the next one, see the unit
test for sample usage.
In addition, the entry to a par_for_inner
does not imply a
barrier and not all threads of a team may even enter an inner parallel
region (e.g., if there is not enough work – read indices – for all team
members). This can lead to unintended side-effects when all team member
write to common variables, see this
code
for an example.
Cmake Options
PAR_LOOP_INNER_LAYOUT
controls how the inner loop is implemented.
PAR_LOOP_INNER_LAYOUT=TVR_INNER_LOOP
uses the Kokkos
TeamVectorRange
, which merges TeamThreadRange
and
ThreadVectorRange
into one loop, to distribute work between threads.
PAR_LOOP_INNER_LAYOUT=TVR_INNER_LOOP
is the only option supported
for CUDA since the Kokkos loops are required for parallelization on
GPUs.
PAR_LOOP_INNER_LAYOUT=SIMDFOR_INNER_LOOP
uses a for
loop with a
#pragma omp simd
to vectorize the loop, which typically gives better
vectorization loops than PAR_LOOP_INNER_LAYOUT=TVR_INNER_LOOP
on
CPUs and so is the default on CPUs.
Performance Considerations
Hierarchical parallelism can produce very performant code, but a deeper awareness of how hardware is mapped to threads is required to get optimal performance. Here we list a few strategies/considerations.
On CPU, with SIMDFOR_INNER_LOOP you may have trouble vectorizing unless you help the compiler along. One way to do so is to work with raw pointers to contiguous memory, rather than working with views and strides. Even for stencil ops, if you can pull out pointers that represent the different points on the stencil, this can help with vectorization.
Similarly on CPUs, due to the cost of starting up a vector op, vectorization will only be a performance win if there’s enough work in the inner loop. A minimum of 16 points is required for the op to vectorize at all. Experience shows, however, that at least 64 is really required to see big wins. One strategy for providing enough vector cells in the inner loop is to do a 1D
SIMDFOR
inner loop but combine thej
andi
indices by simply looping over the contiguous memory in a rasterized plane on a block.On GPUs, the outer loop typically maps to blocks, while the inner maps to threads. To see good performance, you must both provide enough work in the inner loop to create enough threads to fill in CUDA terms a streaming multiprocessor (SM, equivalent to a Compute Unit or CU on AMD GPUs) with multiple warps (or wavefronts for AMD) to take advantage of pipelining and enough work in the outer loop to create enough blocks to fill all SMs on the GPU divided by the number of simultaneous streams. The number of warps in flight on the inner loop per SM (which is related to “occupancy”) will depend positively on length of the inner loop and negatively on higher shared memory usage (scratch pad memory in Kokkos parlance and Local Data Share or LDS on AMD GPUs) and higher register usage. Note that the number of SMs and the available shared memory and registers per SM will vary between GPU architectures and especially between GPU vendors.
IndexSplit
To balance the CPU vs GPU hardware considerations of hierarchical
parallelism, Parthenon
provides a utility, the IndexSplit
class, defined in the utils/index_split.hpp
header file and
available in <parthenon/package.hpp>
in the
parthenon::package::prelude
namespace.
In our experience IndexSplit
is most beneficial when working with
small meshblocks on CPUs, especially in two dimensions. For small
blocks, we want vectorized operations over contiguous memory for our
innermost loop, but we want that loop to contain enough work for,
e.g., vector ops to function. We have often found that the optimal
split is to fuse j, and i into the inner loop and use k and blocks in
the outer loop.
The IndexSplit
class can be constructed as
IndexSplit(MeshData<Real> md, IndexDomain domain, const int nkp, const int njp);
where here md
is a MeshData
object on which you want to
operate. domain
specifies where in the MeshBlock
you wish to
operate, for example IndexDomain::Interior
. nkp
and njp
are the number of points in X3
and X2
respectively that are in
the outer loop. All remaining points are in the inner loop; each team
will iterate over multiple k and/or j indices to cover the
specified k/j range. Typically MeshBlock
index in the pack is
also assumed to be in the outer loop. nkp
and njp
also accept
special flags IndexSplit::all_outer
and IndexSplit::no_outer
,
which specify that all and none of the indices in that direction
should be in the outer loop.
Warning
Note that, in contrast to njp
, nkp
points in the
k
-direction are not included in the innermost loop bounds. You
must loop over k
by hand inside the outer loop body.
A second constructor alternatively sets the range for X3
, X2
,
and X1
explicitly:
IndexSplit(MeshData<Real> *md, const IndexRange &kb, const IndexRange &jb,
const IndexRange &ib, const int nkp, const int njp);
where here kb
, jb
, and ib
specify the starting and ending
indices for X3
, X2
, and X1
respecively.
Warning
Note that, at this time, IndexSplit
doesn’t know about
face-centered or edge-centered data. To use IndexSplit
with,
e.g., face-centered data, set the input IndexRange
quantities to
match the shape for the face-centered data (e.g., with the
appropriate offsets).
An IndexSplit
object is typically used as:
using namespace parthenon::package::prelude;
using parthenon::ScratchPad1D;
using parthenon::IndexSplit;
using parthenon::par_for_outer;
using parthenon::par_for_inner;
using parthenon::team_mbr_t;
// Initialize index split object
IndexSplit idx_sp(md, IndexDomain::interior, nkp, njp);
// Request maximum size in i and j in the inner loop, for scratch
const int Ni = idx_sp.get_max_ni();
const int Nj = idx_sp = get_max_nj();
const in tNmax = Ni * Nj;
// single scratch array for i,j
auto scratch_size = ScratchPad1D<Real>::shmem_size(Nmax);
constexpr int scratch_level = 0;
// Par for
par_for_outer(
"KernelOuter", scratch_size,
scratch_level, 0, nblocks - 1, 0, idx_sp.outer_size() - 1,
KOKKOS_LAMBDA(team_mbr_t member, const int b, const int outer_idx) {
ScratchPad1D<Real> scratch(member.team_scratch(scratch_level), Nmax);
// Get index ranges. Note they depend on where we are in the outer index!
// These give us a sense for where we are in k,j space
const auto krange = idx_sp.GetBoundsK(outer_idx);
const auto jrange = idx_sp.GetBoundsJ(outer_idx);
// This is the loop of contiguous inner memory. May contain i and j!
const auto flattened_inner_ijrange = idx_sp.GetInnerBounds(jrange);
const int inner_size = flattened_inner_ijrange.e - flattened_inner_ijrange.s + 1;
// Whatever part of k is not in the outer loop can be looped over
// with a normal for loop here
for (int k = krange.s; k <= krange.e; ++k) {
// pull out a pointer some variable in some pack. Note
// we pick the 0th index of i at k and jrange.s
Real *var = &pack(b, ivar, k, jrange.s, flattened_inner_ijrange.s);
// Do something with the pointer in the inner loop.
par_for_inner(member, 0, flattened_inner_size,
[&](const int i) {
foo(var[i]);
});
}
});