Nested Parallelism
The loop wrappers documented here abstracts a hierarchical parallelism model, which allows more fine grain control on core level vs. thread and vector level parallelism and allows explicitly defined caches for tightly nested loops. These wrappers provide a simplified interface of the hierarchical parallelism in Kokkos
For an example of the nested parallel wrappers in use, see the unit test
par_for_outer
par_for_outer abstracts the team or multicore parallelism for outer
loops. Inside the loop body, in the lambda function provided to
par_for_outer, synchronization and memory sharing between the
threads in the single team is possible through the member_type team
member type from Kokkos.
The bytes of scratch memory for cache needed by a single team is
specified via the scratch_size_in_bytes, which needs be computed
using ScratchPadXD::shmem_size.
The argument scratch_level defines where the scratch memory should
be allocated. For CUDA GPUs, scratch_level=0 allocates the cache in
the faster by smaller shared memory and scratch_level=1
allocates the cache in the slower but larger global or on device RAM
memory. For CPUs, currently scratch_level makes no difference.
Note that every thread within a team will execute code inside a
par_for_outer but outside of the par_for_inner.
par_for_inner
par_for_inner abstracts the thread and vector level parallelism of
compute units within a single team or core. Work defined through a
par_for_inner will be distributed between individual threads and
vector lanes within the team.
ScratchPadXD
Data type for memory in scratch pad/cache memory. Use
ScratchPadXD::shmem_size, which is documented in the Kokkos
documentation
for determining scratch pad memory needs before kernel launch.
Important usage hints
In order to ensure that individual threads of a team are synchronized
always call team_member.team_barrier(); after an par_for_inner
if the following execution depends on the results of the
par_for_inner. This pertains, for example, to filling a
ScratchPadXD array in one par_inner_for and using the scratch
array in the next one, see the unit
test for sample usage.
In addition, the entry to a par_for_inner does not imply a
barrier and not all threads of a team may even enter an inner parallel
region (e.g., if there is not enough work – read indices – for all team
members). This can lead to unintended side-effects when all team member
write to common variables, see this
code
for an example.
Cmake Options
PAR_LOOP_INNER_LAYOUT controls how the inner loop is implemented.
PAR_LOOP_INNER_LAYOUT=TVR_INNER_LOOP uses the Kokkos
TeamVectorRange, which merges TeamThreadRange and
ThreadVectorRange into one loop, to distribute work between threads.
PAR_LOOP_INNER_LAYOUT=TVR_INNER_LOOP is the only option supported
for CUDA since the Kokkos loops are required for parallelization on
GPUs.
PAR_LOOP_INNER_LAYOUT=SIMDFOR_INNER_LOOP uses a for loop with a
#pragma omp simd to vectorize the loop, which typically gives better
vectorization loops than PAR_LOOP_INNER_LAYOUT=TVR_INNER_LOOP on
CPUs and so is the default on CPUs.